rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.3k stars 3.91k forks source link

[Prometheus] Detailed per-queue metrics for incoming/deliver/ack rate #8691

Open ralbertazzi opened 1 year ago

ralbertazzi commented 1 year ago

Is your feature request related to a problem? Please describe.

Hi there!

I'd like to be able to scrape the equivalent of the per-queue incoming/deliver/ack rate that is visible on the RabbitMQ Management page. As far as I understand this would currently be possibly by scraping the detailed metrics under the channel_queue_metrics and channel_queue_exchange_metrics. However, my setup is composed of a limited set of queues (< 50) and a really high number of connections/channels (10k to 100k), therefore I cannot afford scraping those per-channel metrics.

Describe the solution you'd like

I'd like those metrics to be per-queue and find them inside the queue_coarse_metrics or queue_metrics, depending on how expensive their computation is.

Describe alternatives you've considered

No response

Additional context

No response

Dentxinho commented 1 year ago

I totally support this. Scraping such metric becomes slow, because channel label values keep increasing indefinitely. queue_messages_published_total could be inside queue_coarse_metrics

We didn't upgrade from 3.7 and kept using prometheus_rabbitmq_exporter because of this issue, but now we have to make a mandatory upgrade. Losing such metrics will cause problems with our server monitoring and alerting.

lukebakken commented 1 year ago

@ralbertazzi @Dentxinho @elonzh @JanEggers @berkanteber @ronaldovcjr @GabrielHenriqueS @atilafsantos @eloisamorim

Any interest in providing assistance in implementing this feature? Assistance could be...

Dentxinho commented 1 year ago

@lukebakken I can test this feature before going live.

ikavgo commented 1 year ago

@Dentxinho probably I'm the author of the plugin you still use. Maybe I can help.

ikavgo commented 1 year ago

@ralbertazzi why you want per-queue metrics? will you alert depending on queue name? please tell me more how it will actually be used - graphed or alerted or what?

ralbertazzi commented 1 year ago

I'd say it would be mostly used for graphing rather than alerting.

Right now we graph the number of read and unacked messages for each queue, but that is not enough to understand why messages can start getting queued up (did incoming traffic increase? Did consumer capacity decrease? Is there an issue in our consumers? Has our routing logic changed?).

Furthermore, that information could be useful to fine tune consumer auto scaling (whose metrics are fetched from Prometheus): for example, if the incoming message rate is twice the ack rate we might want to double the number of consumers.

ikavgo commented 1 year ago

Thanks,

messages can start getting queued up for a host of reasons, some of them are Rabbit-specific, some env-specific, and some are in your code. Having per-queue queue length gives no info about the reason, just as you write

if you still have your SLO/SLA honored by the API, why would you want to know queue lengths? If your user-facing load behaves not OK you will know it anyway from app-specific metrics and it's a matter of investigating whole stack isn't it? Say if your queue length grows because your consumers are OOMed you probably will know that from respective OS/k8s/process metrics.

why can't you tune your consumers auto-scaling based on their metrics? I.e. there are metrics emitted by producers and metrics emitted by consumers. Simple expression let's you compare and alert. And then, upon alert you can investigate why it is happening. Which is not necessarily because of Rabbit.

So I wonder what kind of apps metrics you have (or don't have) that you want to rely on Rabbit.

ralbertazzi commented 1 year ago

Your reasoning relies on the fact that:

  1. Application do expose metrics
  2. Application metrics are reliable
  3. One has visibility over the entire stack of applications that interact with the instance

It's very hard to achieve all three points, especially the third one. We have a big RabbitMQ instance that is shared among multiple user-facing applications managed by different teams and so having visibility over everything is impractical.

When dealing with databases and message queues (to name some of them, PostgreSQL, Redis, GCP Pub/Sub), I've always found it better to have server-side metrics - which ideally should be reliable and be considered a single source of truth - rather than re-implementing the wheel every time I develop an application that interfaces with such services. RabbitMQ should do the same IMO, and it already does it great! It's just missing a few useful information :)

From a scalability point of view, it's also better to have centralized metrics emitted by a single component (the RabbitMQ instance in this case) rather than aggregating the application metrics that are potentially emitted by thousands or hundreds of thousands of components. This is exactly the reason why we can't fetch this information from per-channel metrics, cause we do have a high number of consumers.

if you still have your SLO/SLA honored by the API, why would you want to know queue lengths?

We don't have SLO/SLA and I don't want to know queue length - I do actually, but this is not the point of the issue. I want to know queue rates. Having such information would increase the observability of the tool. This feels to me like "why would you want to know the RAM usage of your app if SLO is honored?" - I generally don't, but if there's an outage I want to investigate what's happening and then I do want to know that's the RAM usage of my app.

Forgive me, I'm really failing to understand why there's such pushback on this feature (at least that's my feeling). This is actually already available and in three different places (in the RabbitMQ Management UI, in the RabbitMQ Management HTTP API, in the very same Prometheus metrics, but per-channel instead of per-queue), one of them being rabbitmq-prometheus itself. I would find really ugly to scrape these metrics from the Management HTTP API when there's such an awesome plugin already available and very close to implement this :)

ikavgo commented 1 year ago

There is no pushback, at least not from me. I personally just try to understand what is going on.

From simple RabbitMQ maintainer POV I have this concern - you have scenario with a few queues and many many connections and exposing per-queue metrics is not a performance concern. Others have opposite - many-many queues. So adding new metrics is actually like adding API, right? We contract ourselves to maintaining it. Now, suppose we added a new set of per-queue metrics and you are happy. Then next user comes and complains it crashes or slows down Rabbit because they have 1000s queues.

I didn't make any assumptions on where your applications run (as in server vs non-server). From Rabbit perspective as middleware almost everything is an application.

If you have thousands or hundreds of thousands of components how you debug starting from receiving alert on queue length(rate)? Frankly speaking, from here it looks like you want us to implement something to fill the gaps in your observability stack. Again, nothing wrong with having as much RabbitMQ metrcs as needed to deal with Rabbit. I, personally, don't think Rabbit's metrics should be used for monitoring apps.

So this is a friendly chat where I try to understand your case better and come up with improvement that beneficial not only to you and won't give us headache down the road with other use-cases.

ralbertazzi commented 1 year ago

There is no pushback, at least not from me.

All right, thanks for making that explicit :)

Others have opposite - many-many queues. [...] Then next user comes and complains it crashes or slows down Rabbit because they have 1000s queues.

That's exactly the reason why the /metrics/detailed endpoint exists, right? In the same way I don't scrape per-channel metrics because I have a ton of consumers, users with ton of queues should refrain from querying per-queue metrics. As you say, it depends on the use case, and it would be ideal for RabbitMQ to support both of them (right now it just supports the case of a reasonable amount of queues and a limited amount of consumers).

Frankly speaking, from here it looks like you want us to implement something to fill the gaps in your observability stack.

It's not, I think.

how you debug starting from receiving alert on queue length(rate)

I wouldn't create alerts on queue rate, as I've written above. They would be an additional information that can be used to understand what's going on

ikavgo commented 1 year ago

I obviously don't have the advantage of knowing you stack fully :-) It is a weird thing for me that folks want RabbitMQ at the center of their observability world. Of course the fact that data goes thru it centralized kinda nudges, yeah?

if you won't alert on this metrics, why it's not enough to go to Management UI when you want to understand what's going on?

ralbertazzi commented 1 year ago

Because

  1. It's yet another observability tool. We strive for having everything on Prometheus and Grafana
  2. We use RabbitMQ deployed on CloudAMQP. In order to access the Management UI one needs to have both permissions on CloudAMQP and IP whitelisting. It's tedious and limits the visibility of information to whoever may want to understand what's going on
  3. Metrics on the RabbitMQ Management UI have a maximum retention of 1h AFAIK, while it would be great to extend it
michaelklishin commented 1 year ago

@ralbertazzi then you should contribute what you need or bug CloudAMQP to contribute what their customers need. CloudAMQP don't pay a dime for RabbitMQ, despite directly making money off of it, and we are being asked to do more to accommodate their users? Sure, sounds reasonable!

ralbertazzi commented 1 year ago

I can understand the frustration, and I'm sorry for that. I have 0 erlang experience, but I can definitely ask CloudAMQP for a contribution. Nevertheless, I believe this issue should be taken into consideration (as it's currently being) by the community regardless of the fact that I'm using CloudAMQP or other options to run RabbitMQ, isn't that reasonable too?

ralbertazzi commented 1 year ago

(I am an active contributor of open source projects too and I don't receive money from companies that use my contributions but I also don't prioritize the incoming feature requests based on whether the requests comes from a corporate world or from an individual developer)

michaelklishin commented 1 year ago

@ralbertazzi it's not the only factor we consider, of course, but when someone uses CloudAMQP's limitations as an argument, I find it very reasonable to recommend that CloudAMQP solve this. They get all the RabbitMQ improvements without contributing much, especially recently.

Some RabbitMQ-as-a-Service-related improvements benefit all providers, others do not. If it was an obvious improvement, this conversions would have been much shorter.

ralbertazzi commented 1 year ago

Please also note that out of three points that I listed only one can be considered a CloudAMQP limitations. I would have opened the issue even if point 2 had been completely solved

Dentxinho commented 1 year ago

Just sharing our use case here:

Like I said, we have been using v3.7 with plugin https://github.com/deadtrickster/prometheus_rabbitmq_exporter

This plugin exposes rabbitmq_queue_messages_published_total like this:

image

We rely on these metrics to monitoring / alerting:

PromQL: published: sum(rate(rabbitmq_queue_messages_published_total{queue="my-queue", name="my-rabbit"}[30s])) delivered: sum(rate(rabbitmq_queue_messages_delivered_total{queue="my-queue", name="my-rabbit"}[30s]))

image (This is just one of our flow dashboards, but there are many more)

We can identify if some application flow is producing more, less or none messages and act to remediate. Sometimes we need to scale consumers manually, sometimes the producer has some sort of problem and stops / slows down. You see that rabbitmq_queue_messages_ready/unacked doesn't help identifying publish/deliver issues.

From the built-in prometheus plugin, aggregated metrics doesn't help much:

image

Detailed metrics brings queue aggregation, and can be used to run that PromQL queries above. The main problem is that it brings aggregation by channel too. Publisher channels can be opened and closed, and channel label values keep increasing indefinitely:

image With many many queues, and many many producer channels, this slows down a lot the metrics endpoint. IMO the whole point at https://github.com/rabbitmq/rabbitmq-prometheus/issues/24 was about this "group by channel label" thing and not about "which metrics are being exposed"

OP suggests per-queue publish/deliver metrics inside queue_coarse_metrics, without the channel labels. This mimics prometheus_rabbitmq_exporter plugin behaviour about those metrics.

Edit, just to clarify: I support OP suggestion :)

ikavgo commented 1 year ago

queue metrics in that plugin also configurable via queue_messages_stat.

Dentxinho commented 1 year ago

@lukebakken @ikvmw I provided more info on how we would be using these metrics.

Will you consider implementing such a change?

gomoripeti commented 1 year ago

I am happy to take a stab at the implementation (sooner or later) if there is confirmation from the Core Team if this can be added and exactly what. (Unfortunately burden of maintenance will still lie on the Core Team)

Dentxinho commented 1 year ago

I am happy to take a stab at the implementation (sooner or later) if there is confirmation from the Core Team if this can be added and exactly what. (Unfortunately burden of maintenance will still lie on the Core Team)

@lukebakken @michaelklishin

michaelklishin commented 1 year ago

@gomoripeti you are welcome to try. I cannot guarantee that such change would be accepted because I haven't touched metric storage in a long time. If it's a matter of exposing existing metrics, I'd say the risk of your PR being rejected should be low.

But then again, it's a decision that @ikvmw @dcorbacho @mkuratczyk would be in a much better position to make.

cracking-dudes commented 1 month ago

Hey all, is this request is taken into consideration to develop or any patches given to include incoming/deliver/ack rate in Prometheus plugin?

gomoripeti commented 1 month ago

hi @cracking-dudes 4.0 introduced new detailed metric groups (that are similar to per channel groups but without the channel label). Would metrics in the queue_delivery_metrics be sufficient for your needs?

Maybe this issue can be marked as resolved with https://github.com/rabbitmq/rabbitmq-server/pull/11559

cracking-dudes commented 1 month ago

hi @gomoripeti since we use 3.13.3 version i cant comment on whether issue is resolved

but i can see queue_delivery_metrics will solve the problem as per my understanding. someone who have installed 4.0version can comment on resolving this issue. thanks

gomoripeti commented 1 month ago

sorry, I meant to address the second part of my comment to the Core Team.

ralbertazzi commented 1 month ago

As the original creator of the issue, I think - by just looking at the documentation ⚠️ - that the new metrics group does indeed cover some of the per-queue metrics that can be viewed in the Management page. Thanks a lot for adding them!

I think though the incoming messages is still missing. It would be great having both input (incoming) and output (delivery, ack) metrics since you can then observe/build alerts in case you want the two rates to stay roughly the same.