rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.22k stars 3.91k forks source link

Provide detailed memory metrics via prometheus plugin #11743

Closed der-eismann closed 3 months ago

der-eismann commented 3 months ago

Is your feature request related to a problem? Please describe.

Hey everyone, we are currently working on replacing the soon-to-be EOL https://github.com/kbudde/rabbitmq_exporter with the built-in prometheus plugin. With that exporter it was possible to get detailed memory statistics from the management plugin, which have helped us debug issues: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbitmq_management/priv/www/js/tmpl/memory.ejs#L9-L31

Unfortunately I was unable to get these metrics from the prometheus plugin, the only thing that came close was process_resident_memory_bytes (https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbitmq_prometheus/src/collectors/prometheus_rabbitmq_core_metrics_collector.erl#L72)

Describe the solution you'd like

Provide all memory metrics from the management UI via prometheus plugin

Describe alternatives you've considered

No response

Additional context

No response

michaelklishin commented 3 months ago

This is open source software, you are welcome to contribute what you find missing.

The data comes from rabbit_vm:memory/0 and the metrics belong to this group.

der-eismann commented 3 months ago

Wow, I wasn't even able to finish my Erlang introductory course in that short time. Thanks for adding these metrics so quickly!

michaelklishin commented 3 months ago

These metrics are fairly expensive with many queues and streams, so we will limit this to 4.0 and look for ways to optimize this or make this opt-in.

mkuratczyk commented 3 months ago

@der-eismann We have now merged this into main/4.0 (but not 3.13). There's a dedicated endpoint for these metrics: https://www.rabbitmq.com/docs/next/prometheus#memory-breakdown-endpoint

However, I struggle to find a nice Grafana vizualization for these metrics. There are quite a few of them and multiple by the number of nodes in the cluster, you get a lot of data points. Are you currently visualizing these metrics from the exporter? Can you share wha that looks like? Ideally, if you could contribute a panel for them, that'd be great.

The RabbitMQ Overview dashboard JSON source file is here if you want to give it a try: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbitmq_prometheus/docker/grafana/dashboards/RabbitMQ-Overview.json

der-eismann commented 3 months ago

Hey @mkuratczyk, we used these metrics to figure out why the memory consumption is so high and with them we noticed that a huge chunk is allocated unused. The visualization is more of a quick and dirty kind, but I can try to polish to contribute it. But these are from the old exporter, we don't have the 4.0 beta running yet. Need to invest some time for that, not sure when I can find that in the next two weeks.

screenshot-20240723-153120

mkuratczyk commented 3 months ago

That's ok, no rush. Seems like the external exporter provided fewer metrics and you still presented them separate for each node (which totally makes sense). As usual, the problem for us is that when we provide something, users expect it to "just work everywhere" and some users have 9 nodes in the cluster or more so that's suddenly quite a few new panels. Perhaps a separate dashboard would be useful. Then we can just do it per node and use Grafana's repeat option.