Open ralbertazzi opened 1 year ago
I totally support this.
Scraping such metric becomes slow, because channel label values keep increasing indefinitely.
queue_messages_published_total
could be inside queue_coarse_metrics
We didn't upgrade from 3.7 and kept using prometheus_rabbitmq_exporter
because of this issue, but now we have to make a mandatory upgrade. Losing such metrics will cause problems with our server monitoring and alerting.
@ralbertazzi @Dentxinho @elonzh @JanEggers @berkanteber @ronaldovcjr @GabrielHenriqueS @atilafsantos @eloisamorim
Any interest in providing assistance in implementing this feature? Assistance could be...
@lukebakken I can test this feature before going live.
@Dentxinho probably I'm the author of the plugin you still use. Maybe I can help.
@ralbertazzi why you want per-queue metrics? will you alert depending on queue name? please tell me more how it will actually be used - graphed or alerted or what?
I'd say it would be mostly used for graphing rather than alerting.
Right now we graph the number of read and unacked messages for each queue, but that is not enough to understand why messages can start getting queued up (did incoming traffic increase? Did consumer capacity decrease? Is there an issue in our consumers? Has our routing logic changed?).
Furthermore, that information could be useful to fine tune consumer auto scaling (whose metrics are fetched from Prometheus): for example, if the incoming message rate is twice the ack rate we might want to double the number of consumers.
Thanks,
messages can start getting queued up for a host of reasons, some of them are Rabbit-specific, some env-specific, and some are in your code. Having per-queue queue length gives no info about the reason, just as you write
if you still have your SLO/SLA honored by the API, why would you want to know queue lengths? If your user-facing load behaves not OK you will know it anyway from app-specific metrics and it's a matter of investigating whole stack isn't it? Say if your queue length grows because your consumers are OOMed you probably will know that from respective OS/k8s/process metrics.
why can't you tune your consumers auto-scaling based on their metrics? I.e. there are metrics emitted by producers and metrics emitted by consumers. Simple expression let's you compare and alert. And then, upon alert you can investigate why it is happening. Which is not necessarily because of Rabbit.
So I wonder what kind of apps metrics you have (or don't have) that you want to rely on Rabbit.
Your reasoning relies on the fact that:
It's very hard to achieve all three points, especially the third one. We have a big RabbitMQ instance that is shared among multiple user-facing applications managed by different teams and so having visibility over everything is impractical.
When dealing with databases and message queues (to name some of them, PostgreSQL, Redis, GCP Pub/Sub), I've always found it better to have server-side metrics - which ideally should be reliable and be considered a single source of truth - rather than re-implementing the wheel every time I develop an application that interfaces with such services. RabbitMQ should do the same IMO, and it already does it great! It's just missing a few useful information :)
From a scalability point of view, it's also better to have centralized metrics emitted by a single component (the RabbitMQ instance in this case) rather than aggregating the application metrics that are potentially emitted by thousands or hundreds of thousands of components. This is exactly the reason why we can't fetch this information from per-channel metrics, cause we do have a high number of consumers.
if you still have your SLO/SLA honored by the API, why would you want to know queue lengths?
We don't have SLO/SLA and I don't want to know queue length - I do actually, but this is not the point of the issue. I want to know queue rates. Having such information would increase the observability of the tool. This feels to me like "why would you want to know the RAM usage of your app if SLO is honored?" - I generally don't, but if there's an outage I want to investigate what's happening and then I do want to know that's the RAM usage of my app.
Forgive me, I'm really failing to understand why there's such pushback on this feature (at least that's my feeling). This is actually already available and in three different places (in the RabbitMQ Management UI, in the RabbitMQ Management HTTP API, in the very same Prometheus metrics, but per-channel instead of per-queue), one of them being rabbitmq-prometheus itself. I would find really ugly to scrape these metrics from the Management HTTP API when there's such an awesome plugin already available and very close to implement this :)
There is no pushback, at least not from me. I personally just try to understand what is going on.
From simple RabbitMQ maintainer POV I have this concern - you have scenario with a few queues and many many connections and exposing per-queue metrics is not a performance concern. Others have opposite - many-many queues. So adding new metrics is actually like adding API, right? We contract ourselves to maintaining it. Now, suppose we added a new set of per-queue metrics and you are happy. Then next user comes and complains it crashes or slows down Rabbit because they have 1000s queues.
I didn't make any assumptions on where your applications run (as in server vs non-server). From Rabbit perspective as middleware almost everything is an application.
If you have thousands or hundreds of thousands of components how you debug starting from receiving alert on queue length(rate)? Frankly speaking, from here it looks like you want us to implement something to fill the gaps in your observability stack. Again, nothing wrong with having as much RabbitMQ metrcs as needed to deal with Rabbit. I, personally, don't think Rabbit's metrics should be used for monitoring apps.
So this is a friendly chat where I try to understand your case better and come up with improvement that beneficial not only to you and won't give us headache down the road with other use-cases.
There is no pushback, at least not from me.
All right, thanks for making that explicit :)
Others have opposite - many-many queues. [...] Then next user comes and complains it crashes or slows down Rabbit because they have 1000s queues.
That's exactly the reason why the /metrics/detailed
endpoint exists, right? In the same way I don't scrape per-channel metrics because I have a ton of consumers, users with ton of queues should refrain from querying per-queue metrics. As you say, it depends on the use case, and it would be ideal for RabbitMQ to support both of them (right now it just supports the case of a reasonable amount of queues and a limited amount of consumers).
Frankly speaking, from here it looks like you want us to implement something to fill the gaps in your observability stack.
It's not, I think.
how you debug starting from receiving alert on queue length(rate)
I wouldn't create alerts on queue rate, as I've written above. They would be an additional information that can be used to understand what's going on
I obviously don't have the advantage of knowing you stack fully :-) It is a weird thing for me that folks want RabbitMQ at the center of their observability world. Of course the fact that data goes thru it centralized kinda nudges, yeah?
if you won't alert on this metrics, why it's not enough to go to Management UI when you want to understand what's going on?
Because
@ralbertazzi then you should contribute what you need or bug CloudAMQP to contribute what their customers need. CloudAMQP don't pay a dime for RabbitMQ, despite directly making money off of it, and we are being asked to do more to accommodate their users? Sure, sounds reasonable!
I can understand the frustration, and I'm sorry for that. I have 0 erlang experience, but I can definitely ask CloudAMQP for a contribution. Nevertheless, I believe this issue should be taken into consideration (as it's currently being) by the community regardless of the fact that I'm using CloudAMQP or other options to run RabbitMQ, isn't that reasonable too?
(I am an active contributor of open source projects too and I don't receive money from companies that use my contributions but I also don't prioritize the incoming feature requests based on whether the requests comes from a corporate world or from an individual developer)
@ralbertazzi it's not the only factor we consider, of course, but when someone uses CloudAMQP's limitations as an argument, I find it very reasonable to recommend that CloudAMQP solve this. They get all the RabbitMQ improvements without contributing much, especially recently.
Some RabbitMQ-as-a-Service-related improvements benefit all providers, others do not. If it was an obvious improvement, this conversions would have been much shorter.
Please also note that out of three points that I listed only one can be considered a CloudAMQP limitations. I would have opened the issue even if point 2 had been completely solved
Just sharing our use case here:
Like I said, we have been using v3.7 with plugin https://github.com/deadtrickster/prometheus_rabbitmq_exporter
This plugin exposes rabbitmq_queue_messages_published_total
like this:
We rely on these metrics to monitoring / alerting:
PromQL:
published: sum(rate(rabbitmq_queue_messages_published_total{queue="my-queue", name="my-rabbit"}[30s]))
delivered: sum(rate(rabbitmq_queue_messages_delivered_total{queue="my-queue", name="my-rabbit"}[30s]))
(This is just one of our flow dashboards, but there are many more)
We can identify if some application flow is producing more, less or none messages and act to remediate. Sometimes we need to scale consumers manually, sometimes the producer has some sort of problem and stops / slows down.
You see that rabbitmq_queue_messages_ready
/unacked doesn't help identifying publish/deliver issues.
From the built-in prometheus plugin, aggregated metrics doesn't help much:
Detailed metrics brings queue aggregation, and can be used to run that PromQL queries above. The main problem is that it brings aggregation by channel too. Publisher channels can be opened and closed, and channel label values keep increasing indefinitely:
With many many queues, and many many producer channels, this slows down a lot the metrics endpoint. IMO the whole point at https://github.com/rabbitmq/rabbitmq-prometheus/issues/24 was about this "group by channel label" thing and not about "which metrics are being exposed"
OP suggests per-queue publish/deliver metrics inside queue_coarse_metrics
, without the channel labels. This mimics prometheus_rabbitmq_exporter
plugin behaviour about those metrics.
Edit, just to clarify: I support OP suggestion :)
queue metrics in that plugin also configurable via queue_messages_stat
.
@lukebakken @ikvmw I provided more info on how we would be using these metrics.
Will you consider implementing such a change?
I am happy to take a stab at the implementation (sooner or later) if there is confirmation from the Core Team if this can be added and exactly what. (Unfortunately burden of maintenance will still lie on the Core Team)
I am happy to take a stab at the implementation (sooner or later) if there is confirmation from the Core Team if this can be added and exactly what. (Unfortunately burden of maintenance will still lie on the Core Team)
@lukebakken @michaelklishin
@gomoripeti you are welcome to try. I cannot guarantee that such change would be accepted because I haven't touched metric storage in a long time. If it's a matter of exposing existing metrics, I'd say the risk of your PR being rejected should be low.
But then again, it's a decision that @ikvmw @dcorbacho @mkuratczyk would be in a much better position to make.
Hey all, is this request is taken into consideration to develop or any patches given to include incoming/deliver/ack rate in Prometheus plugin?
hi @cracking-dudes 4.0 introduced new detailed metric groups (that are similar to per channel groups but without the channel label). Would metrics in the queue_delivery_metrics
be sufficient for your needs?
Maybe this issue can be marked as resolved with https://github.com/rabbitmq/rabbitmq-server/pull/11559
hi @gomoripeti since we use 3.13.3 version i cant comment on whether issue is resolved
but i can see queue_delivery_metrics will solve the problem as per my understanding. someone who have installed 4.0version can comment on resolving this issue. thanks
sorry, I meant to address the second part of my comment to the Core Team.
As the original creator of the issue, I think - by just looking at the documentation ⚠️ - that the new metrics group does indeed cover some of the per-queue metrics that can be viewed in the Management page. Thanks a lot for adding them!
I think though the incoming
messages is still missing. It would be great having both input (incoming) and output (delivery, ack) metrics since you can then observe/build alerts in case you want the two rates to stay roughly the same.
Is your feature request related to a problem? Please describe.
Hi there!
I'd like to be able to scrape the equivalent of the per-queue
incoming/deliver/ack
rate that is visible on the RabbitMQ Management page. As far as I understand this would currently be possibly by scraping the detailed metrics under thechannel_queue_metrics
andchannel_queue_exchange_metrics
. However, my setup is composed of a limited set of queues (< 50) and a really high number of connections/channels (10k to 100k), therefore I cannot afford scraping those per-channel metrics.Describe the solution you'd like
I'd like those metrics to be per-queue and find them inside the
queue_coarse_metrics
orqueue_metrics
, depending on how expensive their computation is.Describe alternatives you've considered
No response
Additional context
No response