rabbitmq / rabbitmq-prometheus

A minimalistic Prometheus exporter of core RabbitMQ metrics
Other
147 stars 110 forks source link

Option to aggregate channel, queue and connection metrics #28

Closed dcorbacho closed 4 years ago

dcorbacho commented 4 years ago

prometheus.return_per_object_metrics = false

Closes #26, see #24 and #25 for the background.

gerhard commented 4 years ago

Picking this one up now.

gerhard commented 4 years ago

Given 1k queues on rmq2, scrape duration is 2-3s. Running rabbitmqctl eval 'application:set_env(rabbitmq_prometheus, enable_metric_aggregation, true).' on rmq2 brings it down to 250ms, similar to what the other nodes are doing:

Screenshot 2020-01-15 at 18 57 08

RabbitMQ-Overview dashboard is not affected by these changes.

I will test RabbitMQ-Quorum-Queues-Raft tomorrow - I expect a few changes needed there.

I will increase the number of queues all the way to 80k and see if this still holds. The last phase is to increase the number of connections & channels to 80k each and see if this optimisations holds.

gerhard commented 4 years ago

I am picking this one up again, deploying 50k queues, 50k connections & 50k channels.

gerhard commented 4 years ago

Tested on:

gcloud compute instances create-with-container tgir-s01e01-gerhard-rmq1-server \
  --public-dns --boot-disk-type=pd-ssd --labels=namespace=rabbitmq-prometheus-28-gerhard --container-stdin --container-tty \
  --machine-type=n1-standard-32 \
  --create-disk=name=rabbitmq-prometheus-28-gerhard-rmq1-server-persistent,size=200GB,type=pd-ssd,auto-delete=yes \
  --container-mount-disk=name=rabbitmq-prometheus-28-gerhard-rmq1-server-persistent,mount-path=/var/lib/rabbitmq \
  --container-env RABBITMQ_ERLANG_COOKIE=rabbitmq-prometheus-28-gerhard \
  --container-image=pivotalrabbitmq/rabbitmq-prometheus:3.9.0-alpha.203-2020.02.03

with curl -s -o /dev/null -w '%{http_code} time_total:%{time_total} size_bytes:%{size_download}\n' http://34.89.10.130:15692/metrics

50k queues with & without metric aggregation enabled:

200 time_total:60.395880 size_bytes:62322787
200 time_total:1.023527 size_bytes:347759

When I had 50k connections on top of the 50k queues the metrics would timeout after 60s:

000 time_total:60.105977 size_bytes:0

With metric aggregation enabled & then with -H "Accept-Encoding: gzip"

200 time_total:1.705673 size_bytes:348528
200 time_total:1.607329 size_bytes:21323

:shipit:

gerhard commented 4 years ago

@dcorbacho can we pair-up on this tomorrow? https://github.com/rabbitmq/rabbitmq-prometheus/commit/5caa4198b17099d87df5e7ce5faa0b8ae6edd42d

gerhard commented 4 years ago

FWIW, https://github.com/rabbitmq/rabbitmq-prometheus/commit/378da2f7c32a03712e5f6b2181e102bce3c402a3 enables metrics aggregation by default. The reasoning is captured in the README. This is the follow-up commit https://github.com/rabbitmq/rabbitmq-prometheus/commit/8b0c7c4f4e01ad0d7d2f39ec478add059da5c112.