Monitor queues using wildcard

optimistic5 commented 4 years ago

Now I have this alert for Prometheus:

- alert: RabbitmqTooManyMessagesInQueue
    expr: rabbitmq_queue_messages_ready{queue="my-queue"} > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Rabbitmq too many messages in queue (instance {{ $labels.instance }})"
      description: "Queue is filling up (> 1000 msgs)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Notice than I have: {queue="my-queue"}

To monitor all queues I can use this line, right? rabbitmq_queue_messages_ready

But I want to monitor queues using wildcard. For example: {queue="guest_*"}

Now I need to add each queues separately, but it is a lot of them and new queues becomes often. This feature will help a lot. Thank you.

michaelklishin commented 4 years ago

This plugin emits metrics for all queues unconditionally. @gerhard this sounds like something other tools should handle.

gerhard commented 4 years ago

You can download the official RabbitMQ-Overview Grafana dashboard to see how we aggregate metrics across multiple objects (queues/connections/channels) etc. This is the query that will work in your case:

sum(rabbitmq_queue_messages_ready{queue =~ "^guest_.+"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster"}) by(rabbitmq_node)

From v3.8.3 onwards, all metrics are aggregated. To enable per object metrics, you need to add the following line to your rabbitmq.conf file:

prometheus.return_per_object_metrics = true

If the above helps, please close the issue.

optimistic5 commented 4 years ago

This query should work for prometheus alerting, correct?

gerhard commented 4 years ago

This expression will work for Prometheus alerting if you enable metrics per object (see config from previous comment):

rabbitmq_queue_messages_ready{queue =~ "^guest_.+"} > 1000

If you need to know the cluster name (because you are running more than 1 RabbitMQ deployment that you want to have alerting for), your alerting query will become:

(rabbitmq_queue_messages_ready{queue =~ "^guest_.+"} * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info) > 1000

michaelklishin commented 4 years ago

FTR, per-object metric collection now has a brief documentation section.

lechen26 commented 4 years ago

i see that its not recommanded to use the "metrics per object". the things is, most of the alerts and use cases are based on queue. i mean, we usually want to get all metrics and indicate if specific queue has too many ready messages, we are using it specifically for HPA (scale more pods based on queue size).

until now we've used the prometheus_rabbitmq_exporter , but now i saw this exporter comes built-it so thought it will be better, as also the grafana dashboard counts on it. but how do you monitor the rabbitmq cluster as a whole? its only our usecase to view per queue metrics? is there anything else we should do besides enabling the "not for production" flag? thanks!

michaelklishin commented 4 years ago

@lechen26 FYI, this is not a support forum and RabbitMQ core team would greatly appreciate if you have moved all questions about this plugin and otherwise to the mailing list.

Metric aggregation is a practical necessity in environments with a lot of objects. See some payload size and generation times mentioned in #24. It is not realistic to alert on "metric X in queue Q" when you have 200K of them, each with 35-40 metrics. The math of scraping response size simply would not add up to practical possible response times. In that case you alert on the overall state of your system and then humans narrow it down using other available tools.

Again, unfortunately, N objects by M metrics each and 2-3 lines (including metadata/comments) per metric can produce a very large response. It's a format output issue, not a plugin implementation one, so any Prometheus exporter would face it at some point and either do what we did or end up in the scenario outlined in #24.

michaelklishin commented 4 years ago

I'm not sure what this "not for production" flag is. This plugin is recommended for production. It even has two modes of operations now, one for those who want an efficient and compact overview (aggregated metrics) and another for those who want best per-object fidelity. The original plugin does not give you much choice.

michaelklishin commented 4 years ago

OK, I think I understand what the "not for production" comment was about. We will try to edit the docs to explain the options and recommendations.

michaelklishin commented 4 years ago

@lechen26 let me know if the updated doc section makes more sense to you.

rabbitmq / rabbitmq-prometheus

Monitor queues using wildcard #31