Closed bradleyd closed 4 years ago
It can be a matter of endpoint and client timeouts. Simply rendering metrics for, say, 200K entities will take a while to render and compress.
@bradleyd we could use some data about your setup. How can we reproduce? On what kind of hardware and how many nodes? Do you have management plugin enabled as well? How many other stats-emitting entities (e.g. channels and connections) are there? What is the request timeout used by the client (Prometheus or curl)?
Please help us help you.
prometheus.tcp.idle_timeout = 120000
prometheus.tcp.inactivity_timeout = 120000
prometheus.tcp.request_timeout = 120000
would significantly bump all embedded HTTP server's per-connection timeouts.
About 22K queues with the above timeouts on a fairly unimpressive machine from 2013 under some load (a tight loop declaring queues) produce a successful response in about 24 seconds. The response contains well over 400K
lines, which makes sense given that each queue has over a dozen metrics total.
So much higher timeouts do work (at least with curl
which has failry high timeouts I can control easily) but we need a way to exclude certain metrics as the number of lines per response can quickly go into millions as the number of queues goes into 100s of thousands.
@gerhard
See #26.
Instead of querying 169K
queues, maybe using a pagination query could help.
Something like:
http://localhost:15672/api/queues?page=1&page_size=100
does it make sense?
@bradleyd we could use some data about your setup. How can we reproduce? On what kind of hardware and how many nodes? Do you have management plugin enabled as well? How many other stats-emitting entities (e.g. channels and connections) are there? What is the request timeout used by the client (Prometheus or curl)?
Thanks for the quick response.
We have about 169 queues with on average 100K web stomp connections. We are running 5 node Rabbitmq 3.8.2 cluster on AWS/Kubernetes m5.4xlarge EC2. We pin one rabbitmq pod per node so they have full use of the resources. We do have the management plugin enabled. We are using 60s Prometheus scrape time and stat collection set to 60s as well in rabbit.
Anything else I can provide I will be more than happy to.
FWIW we have a testing env we can help with if you need more testing.
Instead of querying
169K
queues, maybe using a pagination query could help. Something like:http://localhost:15672/api/queues?page=1&page_size=100
does it make sense?
We ended up creating our own exporter doing the same thing as the other exporters out there never would complete.
Since this plugin only uses node local ETS, I would expect it to be able to return a few millions of entries within a few seconds. If that is not the case, we need to optimise.
@bradleyd I would like to take you up on your test env offer. I propose we get together for ~30' over a zoom.us - pick the time slot that best works for you.
In preparation, please have this Grafana dashboard set up: run make rabbitmq-exporter_vs_rabbitmq-prometheus.json
in the root of this repo to get an import-friendly version.
I continued digging into this issue, this is what I found:
First of all, I would like to clarify that even though the 5 node cluster mentioned here has many queues, connections & channels, only the objects running on the node that is being scraped will be returned. In other words, if there are 169k queues in total across the entire cluster, but only 30k running on a specific node, when that node's metrics are requested, only the metrics for those 30k queues will be returned. Same applies to connections & channels.
Let's run a few benchmarks locally, against a local RabbitMQ dev node, step-by-step:
# Start a local dev RabbitMQ node
git clone https://github.com/rabbitmq/rabbitmq-public-umbrella
cd rabbitmq-public-umbrella
make up
cd deps/rabbitmq_prometheus
make run-broker
# Get the RabbitMQ benchmarking tool:
git clone https://github.com/rabbitmq/rabbitmq-perf-test
cd rabbitmq-perf-test
# Create 10k queues, then stop all load - no connections or channels, only queues
make run ARGS="-r 1 -z 2 -qp 'q%d' -qpf 1 -qpt 10000 -ad false"
# GET :15692/metrics returns in ~5.8s & transfers ~12MB
for _ in {1..10}; do curl -s -o /dev/null -w '%{http_code} time_total:%{time_total} size_bytes:%{size_download}\n' 127.0.0.1:15692/metrics; done
200 time:5.740556 size:12321318
200 time:5.673923 size:12321320
200 time:6.074273 size:12321390
200 time:5.730781 size:12321372
200 time:5.755640 size:12321375
200 time:5.760754 size:12321379
200 time:5.756736 size:12321376
200 time:5.726672 size:12321378
200 time:5.709239 size:12321382
200 time:5.707628 size:12321379
# Repeat for 20k, 40k, 80k & 160k queues
This paints the following O(n)
-ish picture:
Queues on the node | /metrics response time |
/metrics response size |
---|---|---|
10k | 6s | 12MB |
20k | 12s | 24MB |
40k | 24s | 48MB |
80k | 58s | 96MB |
In summary, GET :15692/metrics
for 80k queues takes 1 minute to complete and transfers ~100MB. This definitely needs improving as we want this plugin to return in a timely manner, even when there are 100k queues running on a node.
Out of curiosity, I was wondering what happens when there are 80k queues and we use gzip for the response body:
for _ in {1..3}
do
- curl -s -o /dev/null \
+ curl -H "Accept-Encoding: gzip" -s -o /dev/null \
-w '%{http_code} time_total:%{time_total} size_bytes:%{size_download}\n' \
127.0.0.1:15692/metrics
done
- 200 time_total:50.538919 size_bytes:99407634
+ 200 time_total:58.899451 size_bytes:6001394
- 200 time_total:57.976542 size_bytes:99407912
+ 200 time_total:51.974838 size_bytes:6003424
- 200 time_total:52.324881 size_bytes:99411941
+ 200 time_total:58.893436 size_bytes:6003428
Response size goes from 100MB to 6MB 😲 I am wondering if Prometheus is using gzip by default 🤔
Lastly, it's worth pointing out that when there are 160k queues, after 60s, there is no response. This matches the default cowboy_http idle_timeout and therefore is the expected behaviour. As @michaelklishin pointed out earlier, this can be adjusted using config properties.
In my opinion, the goal should be to get all metric requests served within 10s, as this enables a polling frequency of 15s, which results in a minimum rate interval of 60s, as captured here. Increasing the minimum rate granularity beyond 60s makes them less useful for alerting, since multiple rule violations are required to trigger an alert. This 60s limit gives us alerts that can be set to trigger at a minimum of 2-3 minutes from the actual event, which is a long enough time to be informed when a certain aspect of the system is unhealthy.
Since this plugin only uses node local ETS, I would expect it to be able to return a few millions of entries within a few seconds. If that is not the case, we need to optimise.
@bradleyd I would like to take you up on your test env offer. I propose we get together for ~30' over a zoom.us - pick the time slot that best works for you.
In preparation, please have this Grafana dashboard set up: run
make rabbitmq-exporter_vs_rabbitmq-prometheus.json
in the root of this repo to get an import-friendly version.
@gerhard Unfortunately I am OOO until Friday. Do you have any slots available that day?
@bradleyd try the time slots link again 👍
@gerhard The calendar link shows nothing for Friday only Monday, is this ok?
@bradleyd tomorrow - Friday - won't work, Monday is the first day that I can make it happen.
The upcoming v3.8.3-rc.1 will be based on this alpha package, which has the final fix: https://github.com/rabbitmq/rabbitmq-server-release/wiki/Changes-in-RabbitMQ-3.8.3-alpha.94#rabbitmq-prometheus
It can be consumed as this Docker image: https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq-prometheus/3.8.3-alpha.93-2020.02.12/images/sha256-03e2941aed557f560a4288968cadbbe3a7fc5b11e579f8defd69e8d7f1412849?context=explore
Within a few days of v3.8.3-rc.1 becoming available on GitHub, we expect a rabbitmq
image with this tag to appear here: https://hub.docker.com/_/rabbitmq?tags
FWIW, we have decided to aggregate metrics by default since it has a significant impact on nodes with many objects. This means that as soon as you update the Docker image to >=3.8.3*
, you will get metrics aggregated by default, no extra steps required.
We are using rabbit 3.8.2 with about 169k queues and when hitting /metrics it times out. Maybe a way to ignore queue metrics for large setups? I assume that is why it is not returned.
Any thoughts would be appropriated.