rabbitmq / rabbitmq-prometheus

A minimalistic Prometheus exporter of core RabbitMQ metrics
Other
145 stars 109 forks source link

Expose individual queue metrics #9

Closed Nyoroon closed 4 years ago

Nyoroon commented 4 years ago

Hello!

It would be nice to have more queue metrics exported, like in separate rabbitmq-exporter: https://github.com/kbudde/rabbitmq_exporter#queues---counter

gerhard commented 4 years ago

Yes, adding more queue metrics would be nice. In the first release we wanted to limit what is exposed so that the impact on RabbitMQ is minimal. This is an example of what is currently exposed:

rabbitmq_queue_messages_published_total{channel="<0.937.0>",queue_vhost="/",queue="greedy-consumer",exchange_vhost="/",exchange="direct"} 142884
rabbitmq_queue_messages_ready{vhost="/",queue="greedy-consumer"} 0
rabbitmq_queue_messages_unacked{vhost="/",queue="greedy-consumer"} 1193
rabbitmq_queue_messages{vhost="/",queue="greedy-consumer"} 1193
rabbitmq_queue_process_reductions_total{vhost="/",queue="greedy-consumer"} 88316199
rabbitmq_queue_disk_reads_total{vhost="/",queue="greedy-consumer"} 0
rabbitmq_queue_disk_writes_total{vhost="/",queue="greedy-consumer"} 0

To help prioritise the queue metrics that will be exposed next, which ones do you find most useful from the list that you linked to?

Nyoroon commented 4 years ago

Currently the most useful metrics for us are counters of delivered messages (message_stats.get/message_stats.ack) and counter of redelivered mesages (message_stats.redeliver). We use them to monitor if there's any problem with consumers.

gerhard commented 4 years ago

All these metrics (and more) are on channels, not queues.

This is a good example of what to expect from the new RabbitMQ-Overview Grafana dashboard (will become available on grafana.com shortly):

image

You can see this dashboard live here: https://grafana.gcp.rabbitmq.com/d/Kn5xm-gZk/rabbitmq-overview?orgId=1&from=now-3h&to=now&refresh=15s

You can also try it out locally: https://www.rabbitmq.com/prometheus.html#quick-start

Which queue metrics do you find most useful?

Nyoroon commented 4 years ago

Oh, thanks!

Also looks like our instance(s) doesn't have any messages published metric, but messages were published:

$ curl -s localhost:15692/metrics | grep rabbitmq_channel_messages_published_total
# TYPE rabbitmq_channel_messages_published_total counter
# HELP rabbitmq_channel_messages_published_total Total number of messages published into an exchange on a channel
gerhard commented 4 years ago

That is interesting.

Can you share a snapshot of your RabbitMQ-Overview dashboard which shows all your metrics? Like this:

image

Resulting snapshot that auto-expires in 1h: https://snapshot.raintank.io/dashboard/snapshot/OzWQOWl1wUNGm5OXV6R7sKQk17Qz0AUP?orgId=2

Nyoroon commented 4 years ago

There it is: http://grafana.condenast.ru/dashboard/snapshot/eKO3ZVU1F3k8VaBWpa4Rc416b1kiv2uG

gerhard commented 4 years ago

It's not loading for me. Do you have a firewall in front?

I've tried connecting from 128.90.153.63 (Russia) as well as 178.208.168.141 (UK) and it doesn't load for either:

curl --connect-timeout 10 http://grafana.condenast.ru/dashboard/snapshot/eKO3ZVU1F3k8VaBWpa4Rc416b1kiv2uG
curl: (28) Connection timed out after 10000 milliseconds

Can you try sharing the snapshot via raintank.io?

Nyoroon commented 4 years ago

Oh, it should be https 😅 https://grafana.condenast.ru/dashboard/snapshot/D4NsIwEaQ8TxFksY4TlxRRIo0IZrnr2k

gerhard commented 4 years ago

Got it!

According to your snapshot, there are 5 consumers all connected to node A, but no publishers.

I can see that you have connection & channel churn all on node A, which suggests that your publishers open a connection, publish some messages, then close the connection. Because this happens quickly, there are no publishing channels alive long enough to get message metrics from them.

Can you confirm?

Nyoroon commented 4 years ago

Yes, that's the case. We have a group of short-lived clients, mostly publishers.

gerhard commented 4 years ago

Short-lived clients are a bad practice, this is why:

While you have low connection churn and are unlikely to experience the problems outlined in the guides that I linked to above, it is not recommended to continue using RabbitMQ like this. If it is impractical to change your publishers, consider putting something like https://github.com/cloudamqp/amqproxy in front of RabbitMQ.

Nyoroon commented 4 years ago

Thanks! But metrics from short-lived channels that can't be re-used will be lost anyway :(

gerhard commented 4 years ago

I see what you mean.

How do your clients publish messages? If they are using basic.publish, then channels will remain open.

Ideally, your publishers will be long-lived, same as your consumers are currently.

Nyoroon commented 4 years ago

I think we mostly use basic.publish, but need to check code.

Thanks for tips, but migration to proxy-sidecars will still take time.

gerhard commented 4 years ago

I can see how exposing more queue metrics would help this specific situation.

I plan on going over all https://github.com/kbudde/rabbitmq_exporter metrics with the goal of closing the gap between the built-in metrics and this external exporter.

Thanks for highlighting the diff!

Nyoroon commented 4 years ago

Maybe total connection counter will be good metric to detect connection churn problems 😁

gerhard commented 4 years ago

See Channels & Connections at the bottom of your RabbitMQ Overview dashboard (screenshot from your earlier snapshot):

image

Btw, if you decide to add numbers (e.g. 1,2,3) to your nodes instead of letters (e.g. a,b,c) you will get this for free: https://www.rabbitmq.com/prometheus.html#graph-colour-labelling

Nyoroon commented 4 years ago

Thanks, missed that 😅

murarustefaan commented 4 years ago

Hi, It would also be nice to have the previous rabbitmq_queue_consumer_utilisation metric in this plugin. Any chance of that happening?

gerhard commented 4 years ago

It would also be nice to have the previous rabbitmq_queue_consumer_utilisation metric in this plugin. Any chance of that happening?

Yes.

The first step was to expose all metrics from the RabbitMQ Management Overview, then work our way through what was left. When we start on the RabbitMQ Queue Grafana dashboard, rabbitmq_queue_consumer_utilisation metric will come up.

FWIW, all open-source dashboards will be available here https://grafana.com/orgs/rabbitmq/dashboards. 3 more are done but not uploaded yet, another one to come before RabbitMQ Queue gets tackled (including this missing metric).

Thanks for another the nudge re missing metrics. They can't come soon enough, I know!

rickloveslamp commented 4 years ago

If I could add onto this, I would say the gauge "queue_head_message_timestamp" from kbudde's exporter is one of the most important for us. I've been trying to move my customers over to using the timestamp for alerting rather than # of messages.

gerhard commented 4 years ago

That is good to know, thanks for the input @rickloveslamp . Will try to work queue_head_message_timestamp into the next series of queue-specific metrics 👍

michaelklishin commented 4 years ago

We have previously asked developers to not depend on that metric as it may go away (as can a few other queue implementation details). I guess if it hasn't about three years later then it can be considered for this exporter.

aldobongio commented 4 years ago

Hi, It would be nice to have the kbudde rabbitmq_partitions metric in the Prometheus plugin. Thanks for the great work!

gerhard commented 4 years ago

It would be nice to have the kbudde rabbitmq_partitions metric in the Prometheus plugin.

I was thinking something similar earlier today: https://github.com/rabbitmq/rabbitmq-prometheus/commit/b5de702965b9d0869fdc217f5924e5f39815b03c (see the singlestat panels section).

👍

hrobertson commented 4 years ago

The rabbitmq_queue_consumers metric is missing a value on my instance of 3.8.1. The Type and Help lines are present, but no actual value line.

# TYPE rabbitmq_queue_consumers gauge
# HELP rabbitmq_queue_consumers Consumers on a queue 

The management UI shows numerous consumers on various channels.

gerhard commented 4 years ago

@hrobertson that implies that no consumers are consuming from any of the queues.

From the queues page, can you enable the Consumers count column and confirm that your queues have consumers bound to them?

hrobertson commented 4 years ago

@gerhard Yes there are many consumers.

gerhard commented 4 years ago

The only explanation that I can think of is that data stored in queue_stats diverged from data stored in queue_metrics. As this plugin simply reports what is in queue_metrics, if the data differs from queue_stats, then you will get the differences that you are mentioning.

The best that I can do is recommend you open a new issue against rabbitmq/rabbitmq-management. Provide the following output from the node where the queues are running on:

  1. rabbitmqctl status
  2. rabbitmqctl list_queues name,consumers --formatter json
  3. rabbitmqctl eval 'io:format("QUEUE STATS:~n~p~n~nQUEUE METRICS:~n~p~n~n", [ets:tab2list(queue_stats), ets:tab2list(queue_metrics)]).'
hrobertson commented 4 years ago

@gerhard In the output of that last command I can see that queue_metrics has the correct non-zero number of consumers for the various queues. I'm guessing therefore that I shouldn't post it to rabbitmq-management.

I created a new classic queue "test" in the default vhost and subscribed a consumer to it and even that does not show in the Prometheus metrics.

Here's the output from that command for just the "test" queue:

PS C:\Program Files\RabbitMQ Server\rabbitmq_server-3.8.1\sbin> .\rabbitmqctl.bat eval 'io:format(""QUEUE STATS:~n~p~n~nQUEUE METRICS:~n~p~n~n"", [ets:tab2list(queue_stats), ets:tab2list(queue_metrics)]).'
QUEUE STATS:
[{{resource,<<"/">>,queue,<<"test">>},
  [{idle_since,<<"2019-11-25 16:05:12">>},
   {consumer_utilisation,''},
   {policy,''},
   {operator_policy,''},
   {effective_policy_definition,#{}},
   {exclusive_consumer_tag,''},
   {single_active_consumer_tag,''},
   {consumers,1},
   {memory,55600},
   {recoverable_slaves,''},
   {state,running},
   {garbage_collection,[{max_heap_size,0},
                        {min_bin_vheap_size,46422},
                        {min_heap_size,233},
                        {fullsweep_after,65535},
                        {minor_gcs,13}]},
   {messages_ram,0},
   {messages_ready_ram,0},
   {messages_unacknowledged_ram,0},
   {messages_persistent,0},
   {message_bytes,0},
   {message_bytes_ready,0},
   {message_bytes_unacknowledged,0},
   {message_bytes_ram,0},
   {message_bytes_persistent,0},
   {head_message_timestamp,''},
   {backing_queue_status,#{avg_ack_egress_rate => 0.0,
                           avg_ack_ingress_rate => 0.0,avg_egress_rate => 0.0,
                           avg_ingress_rate => 0.0,
                           delta => [delta,undefined,0,0,undefined],
                           len => 0,mode => default,next_seq_id => 0,q1 => 0,
                           q2 => 0,q3 => 0,q4 => 0,
                           target_ram_count => infinity}},
   {messages_paged_out,0},
   {message_bytes_paged_out,0}]}]

QUEUE METRICS:
[{{resource,<<"/">>,queue,<<"test">>},
  [{idle_since,1574697912405},
   {consumer_utilisation,''},
   {policy,''},
   {operator_policy,''},
   {effective_policy_definition,[]},
   {exclusive_consumer_pid,''},
   {exclusive_consumer_tag,''},
   {single_active_consumer_pid,''},
   {single_active_consumer_tag,''},
   {consumers,1},
   {memory,55600},
   {slave_pids,''},
   {synchronised_slave_pids,''},
   {recoverable_slaves,''},
   {state,running},
   {garbage_collection,[{max_heap_size,0},
                        {min_bin_vheap_size,46422},
                        {min_heap_size,233},
                        {fullsweep_after,65535},
                        {minor_gcs,13}]},
   {messages_ram,0},
   {messages_ready_ram,0},
   {messages_unacknowledged_ram,0},
   {messages_persistent,0},
   {message_bytes,0},
   {message_bytes_ready,0},
   {message_bytes_unacknowledged,0},
   {message_bytes_ram,0},
   {message_bytes_persistent,0},
   {head_message_timestamp,''},
   {disk_reads,0},
   {disk_writes,0},
   {backing_queue_status,[{mode,default},
                          {q1,0},
                          {q2,0},
                          {delta,{delta,undefined,0,0,undefined}},
                          {q3,0},
                          {q4,0},
                          {len,0},
                          {target_ram_count,infinity},
                          {next_seq_id,0},
                          {avg_ingress_rate,0.0},
                          {avg_egress_rate,0.0},
                          {avg_ack_ingress_rate,0.0},
                          {avg_ack_egress_rate,0.0}]},
   {messages_paged_out,0},
   {message_bytes_paged_out,0}],
  0}]

ok

There are a load of other metrics which have no values. I've just copy-pasted this chunk from the metrics page:

# TYPE rabbitmq_queue_consumers gauge
# HELP rabbitmq_queue_consumers Consumers on a queue
# TYPE rabbitmq_queue_process_memory_bytes gauge
# HELP rabbitmq_queue_process_memory_bytes Memory in bytes used by the Erlang queue process
# TYPE rabbitmq_queue_messages_bytes gauge
# HELP rabbitmq_queue_messages_bytes Size in bytes of ready and unacknowledged messages
# TYPE rabbitmq_queue_messages_ram gauge
# HELP rabbitmq_queue_messages_ram Ready and unacknowledged messages stored in memory
# TYPE rabbitmq_queue_messages_ready_ram gauge
# HELP rabbitmq_queue_messages_ready_ram Ready messages stored in memory
# TYPE rabbitmq_queue_messages_ready_bytes gauge
# HELP rabbitmq_queue_messages_ready_bytes Size in bytes of ready messages
# TYPE rabbitmq_queue_messages_unacked_ram gauge
# HELP rabbitmq_queue_messages_unacked_ram Unacknowledged messages stored in memory
# TYPE rabbitmq_queue_messages_unacked_bytes gauge
# HELP rabbitmq_queue_messages_unacked_bytes Size in bytes of all unacknowledged messages
# TYPE rabbitmq_queue_messages_persistent gauge
# HELP rabbitmq_queue_messages_persistent Persistent messages
# TYPE rabbitmq_queue_messages_persistent_bytes gauge
# HELP rabbitmq_queue_messages_persistent_bytes Size in bytes of persistent messages
# TYPE rabbitmq_queue_messages_paged_out gauge
# HELP rabbitmq_queue_messages_paged_out Messages paged out to disk
# TYPE rabbitmq_queue_messages_paged_out_bytes gauge
# HELP rabbitmq_queue_messages_paged_out_bytes Size in bytes of messages paged out to disk
gerhard commented 4 years ago

You are right, this commit hhttps://github.com/rabbitmq/rabbitmq-prometheus/commit/0057f07d1e870e9b2f000ad25865a89b09938c04#diff-fa0286c717bf744481489229d1e4ab5eR156 introduced a wrong property reference. The plugin reads from queue_consumers in the queue_metrics ETS table, while the property is consumers, as displayed in your example.

Moving this to a new issue. Fix will follow shortly. Thanks for spotting this, bug confirmed 👍

gerhard commented 4 years ago

@murarustefaan #20 adds the metric that you were missing.

espenwa commented 4 years ago

It would be nice to have the kbudde rabbitmq_partitions metric in the Prometheus plugin.

I was thinking something similar earlier today: b5de702 (see the singlestat panels section).

👍

Do you know when it could be possible to get the rabbitmq_partitions exposed? It would make alerting on split-brain-situations much easier.

michaelklishin commented 4 years ago

This plugin exposes some individual queue metrics but also aggregates them by default for reasons well demonstrated in #24. If some metrics do not make sense or are missing we should use new individual issues for them.

@gerhard I assume that this can be closed, even if some metrics are not exposed.

michaelklishin commented 4 years ago

@espenwa when someone contributes it. This is open source software.

espenwa commented 4 years ago

@espenwa when someone contributes it. This is open source software.

Sure, and I really appreciate the great work you all are doing. Since Gerhard had mentioned it I just wanted to check if he intended to implement it. Not my intention to be demanding.