rabbitmq / rabbitmq-prometheus

A minimalistic Prometheus exporter of core RabbitMQ metrics
Other
147 stars 110 forks source link

A way to reduce the number of metric lines emitted #26

Closed michaelklishin closed 4 years ago

michaelklishin commented 4 years ago

See #24 and #25 for the background.

The scraping endpoint serves a good dozen of metrics per stats-emitting entity and not all of them may be used. For example, with 100s of thousands of queues and connections, the number of lines in the response goes into many millions and data transfer goes into hundreds of MiBs, which is above what default settings of this plugin and Prometheus can realistically handle.

One way would be to allow excluding groups, something like this:

prometheus.collectors.opt-out.groups.1 = channel_exchange_metrics
prometheus.collectors.opt-out.groups.2 = channel_queue_metrics

but perhaps a more practical way would be to just delist specific metrics:

prometheus.collectors.opt-out.metrics.1 = queue_messages_paged_out_bytes
prometheus.collectors.opt-out.metrics.2 = queue_messages_ready_bytes

Worth noting that such fine-grained approach would certainly mess up some Grafana dashboard assumptions šŸ¤·ā€ā™‚šŸ¤·ā€ā™€ but allow some key metrics to be kept while halving the amount of data that has to be rendered, compressed and transferred.

michaelklishin commented 4 years ago

Just to add a data point, a test run with 50K queues produced a roughly 60M response with over 1M lines.

lukebakken commented 4 years ago

Should we automatically drop data if the number of queues exceeds a certain limit rather than leaving it up to users to figure out what opt-out is?

Or, opt-in to certain statistics?

michaelklishin commented 4 years ago

One obvious item that we could potentially exclude is HELP comments in the output but I don't know if that would violate the response spec. That should cut roughly 1/3rd of the bandwidth.

gerhard commented 4 years ago

Following on https://github.com/rabbitmq/rabbitmq-prometheus/issues/24#issuecomment-569951271, aggregating metrics would make most sense.

Rather than returning 24 metrics per queue, as we are currently doing, we should return 24 metrics for all queues. We have hints that this is the right thing to do in both Prometheus & Grafana. Prometheus uses up a lot of CPU & memory to render RabbitMQ Overview for durations that span more than a few days, and Grafana is slow to render, i.e. rmq-gcp-38-qq - This Month. This is referred to as the high cardinality problem within the wider Prometheus community.

The only config option that I think we should introduce is to toggle metric aggregation. In RabbitMQ 3.8.x, we should leave the current behaviour and provide users that have a high number of objects the option to enable metric aggregation. In RabbitMQ 3.9.x, we may want to enable metric aggregation by default, and make it possible to disable on-the-fly, without requiring a restart. If we decide to go down this path, there will be:

Currently, for 1k queues, 1k channels & 1k connections, there will be 49k metrics. Aggregating will result in 49 metrics.

What do you think @michaelklishin @dcorbacho?

michaelklishin commented 4 years ago

OK, that sounds like a great long term solution.

On Tue, Dec 31, 2019 at 8:02 PM Gerhard Lazu notifications@github.com wrote:

Following on #24 (comment) https://github.com/rabbitmq/rabbitmq-prometheus/issues/24#issuecomment-569951271, aggregating metrics would make most sense.

Rather than returning 24 metrics per queue, as we are currently doing, we should return 24 metrics for all queues. We have hints that this is the right thing to do in both Prometheus & Grafana. Prometheus uses up a lot of CPU & memory to render RabbitMQ Overview https://urldefense.proofpoint.com/v2/url?u=https-3A__grafana.com_grafana_dashboards_10991&d=DwMCaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=wK8593osN75HgBaGPuOJ8rdBO5dnseij4hzid56kb34&m=qdTW88Ir0HXcHlc3pMM3i2N9yJCyGftRTdMbuoGt5S4&s=yr4TJAedtFgQvBh7JDXRGyWXEq9OBMb5UHutrMWnhjM&e= for durations that span more than a few days, and Grafana is slow to render, i.e. rmq-gcp-38-qq - This Month https://urldefense.proofpoint.com/v2/url?u=https-3A__grafana.gcp.rabbitmq.com_d_Kn5xm-2DgZk_rabbitmq-2Doverview-3ForgId-3D1-26refresh-3D1m-26from-3Dnow-252FM-26to-3Dnow-252FM-26var-2Drabbitmq-5Fcluster-3Drmq-2Dgcp-2D38-2Dqq&d=DwMCaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=wK8593osN75HgBaGPuOJ8rdBO5dnseij4hzid56kb34&m=qdTW88Ir0HXcHlc3pMM3i2N9yJCyGftRTdMbuoGt5S4&s=lRZsSh2Siia-ui4606GX1lZxcSb7ieh1-1Hw9XgfXDE&e= .

The only config option that I think we should introduce is to toggle metric aggregation. In RabbitMQ 3.8.x, we should leave the current behaviour and provide users with a high number of objects to enable metric aggregation. In RabbitMQ 3.9.x, we may want to enable metric aggregation by default, and make it possible to disable on-the-fly, without requiring a restart. If we decide to go down this path, there will be:

  • 24 metrics for all queues
  • 18 metrics for all channels
  • 7 metrics for all channels

Currently, for 1k queues, 1k channels & 1k connections, there will be 49k metrics. Aggregating will result in 49 metrics.

ā€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rabbitmq/rabbitmq-prometheus/issues/26?email_source=notifications&email_token=AAAAIQVTBIKHWWJE3EZUSZ3Q3N3IPA5CNFSM4KBKLVW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH4OG4I#issuecomment-569959281, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAIQSF7UOH5TWHG6QTVM3Q3N3IPANCNFSM4KBKLVWQ .

-- MK

Staff Software Engineer, Pivotal/RabbitMQ

dcorbacho commented 4 years ago

@gerhard @michaelklishin agreed, I might start with it