"Modern" metrics collections (ala Netdata)

Good day, Years ago (as in 2014 when I first got to administer a Mendix stack), Munin was the thing, and though it's not bad per se, I've been busy deploying https://Netdata.cloud as the next big thing and one of the "funs" for me is the Mendix stacks I manage, so I started to consider to import the stats into Netdata as discussed here https://learn.netdata.cloud/docs/agent/collectors/python.d.plugin but then I started to think and my questions:

Is there a "simple" method to generically extract the stats directly from the Mendix admin socket? (ie. without the Python?) This might be the "beterer" method, as I can then point my GoLang colleague in the direction of the Admin port's stats documentation (which I haven't yet found) and life would be great again especially as we could make this a Prometheus type endpoint as an encore (Actually if the Mendix itself was a Prometheus endpoint, that would've been more awesome ;) )

Failing that "simple"/documented/etc. API to the stats, I see the need for a plugin, but that plugin would then be either: (a) a major hack to parse Munin output of m2ee (okay, that might be a general help for more munin people, but that feels just like a yet another trailer at the back that can fail type solution) or (b) I integrate it as part of the m2ee and provide a PR for you to integrate in m2ee-tools

Which route would Mendix etc. advise me to follow?

In the case of (b), any advice specifics to follow?

Hi, thanks for your question. It took a while to get back to you. :|

Is there a "simple" method to generically extract the stats directly from the Mendix admin socket? (ie. without the Python?) This might be the "beterer" method, as I can then point my GoLang colleague in the direction of the Admin port's stats documentation (which I haven't yet found) and life would be great again [...]

There's no real documentation about the Runtime Admin API because it's an "internal" API. This means so much as "if you're doing anything with it, don't ask Mendix Support if the API changes etc". But, it's also a very simple and stable API, for many years already. If you run the m2ee cli program with -vvv (or dump the traffic to/from the port) you can already see the requests and responses flying by.

E.g. if the Admin API password would be 'hevisko' (and port 9k), then the following would give you part of the stats already: curl -X POST http://localhost:9000 -H "X-M2EE-Authentication:aGV2aXNrbw==" -H "Content-Type:application/json" --data '{"action": "runtime_statistics", "params": {}}' where the garbage is simply the base64 of the password. In src/m2ee/client.py you can see the function in Python that does the same, and in the rest of the client.py file, all actions that are implemented. This should be fairly trivial to reimplement in Go, it's really nothing more than a simple json POST request. (Hint: look at the develop branch, not master)

Failing that "simple"/documented/etc. API to the stats, I see the need for a plugin, but that plugin would then be either:

(a) a major hack to parse Munin output of m2ee

Lol, no.

(b) I integrate it as part of the m2ee and provide a PR for you to integrate in m2ee-tools

So, basically, all you need to do (when running this under the same system user permissions as the application is running) to get and pretty print all the stats is:

#!/usr/bin/python3
import m2ee
import pprint
m2 = m2ee.M2EE()
stats, java_version = m2ee.munin.get_stats('values', m2)
pprint.pprint(stats)

So you can do import m2ee from any other Python program and get the stats, look at what's in there and then do whatever you need to do. (For the munin plugin in here, it's just a bunch of print statements that print the values.)

[Note: The difference between the result of this get_stats and calling the Admin API actions directly is that get_stats has post-processing code which deals with Mendix versions that are too old to have certain information, to hotfix stats from some old versions which have bugs putting values in the wrong place. But, I actually think all of that fixup code is already for Mendix versions which are older than what's officially supported now...]

Which route would Mendix etc. advise me to follow?

I'd say, either simply 'reverse engineer' the simple Admin API calls for runtime_statistics and server_statistics and work with the results directly.

[Note: the get_stats also uses the get_all_thread_stack_traces API call in the process. You want to avoid using this if you want to gather stats frequently (munin is just once every 5 minutes) because I suspect it causes a stop-the-world in the JVM.]

In the case of (b), any advice specifics to follow?

No, not really. I would really recommend starting with a single-purpose proof of concept for yourself.

I mean, if you know you're running on a standardized combination of JVM and Mendix versions for your own apps, then you only have to look at what kind of stats you get once, and tetris them somewhere in place so that your netdata system starts showing useful graphs.

Hans

Haven't forgotten about this :)

Busy to look at the methods to push into InfluxDB. Looking at the admin API's responses, since it's JSON, there are the need to just fix a few bits and it should theoretically just be a "simple" pull from Telegraf to InfluxDB

The one question that is "bugging" me is the admin password, or a method to set/have a read-only/stats admin password (which can't do things restart/etc.)

Oh! Now that you mention it... You can totally do that!

There's actually something for that, which is an alternative kind of admin password, which has privileges that are limited to executing read-only monitoring actions. When designing the Runtime Admin API like 11 or 12 (13?) years ago, we already put that in, because, obvious reasons, indeed. Some monitoring plugin asking for statistics should not be able to reset the admin account password for the user defined as such in the application model (duh).

I just browsed around a bit in the latest version of the Mendix Runtime code that I have here to refresh my memories.

This library even has it implemented, but apparently this never reached full-documented-m2ee.yaml. You can additionally provide a monitoring_pass option in the m2ee section in the yaml config. It will end up sending M2EE_MONITORING_PASS as an environment variable when starting the JVM process.

While using that password, you can call the following actions:

echo (which you provide with a 'ping' param and which should say 'pong' but says says 'krak' (Dutch for broken) when there has been any log message on CRITICAL level)
about
runtime status
runtime statistics
server statistics (when I'll ever be writing the "where the m2ee name originates from" page, it will explain why there are two separate calls)
license information
actions currently being executed
logged in user names
cache statistics (I think that's obsolete now and just for Mendix <7)
and, the call to run the health check microflow (an often-under-appreciated way to build a (light-weight while executing!) check that can alert based on business logic or anything customer/application specific that should be able to page someone out of bed)

To see how the individual actions are called with params and to figure out what result they provide, you can play around with m2ee -vvv to see the communication happen. It's all very straightforward, and yes, it's a stable API.

Now, the interesting question is...

Are all the calls that are being done by the get_stats thing covered by this, or will it explode halfway? I can help testing that.
When writing a monitoring plugin, and reusing some of the statistics gathering code in here, can we use some sort of super-minimal m2ee.yaml (it doesn't have to be a written file, you can also inject all needed config dynamically(!)) that is sufficient to just call monitoring actions, and where loading config will not lead to complaints or exceptions or whatever because some of it can't deal with this scenario.

Hans

(when I'll ever be writing the "where the m2ee name originates from" page, it will explain why there are two separate calls)

Perhaps just dictate/explain on a Vimeo/YouTube video ;)

mendix / m2ee-tools

"Modern" metrics collections (ala Netdata) #54