Excessive memory use on concurrent scrape requests

solsson commented 1 year ago

Kminion is a great tool, and I'd like to report performance numbers and possibly receive comments.

We have a 3 broker kafka cluster with ~900 topics and ~2700 consumer groups. Below is scrape duration and container CPU consumption (fraction of 1 core):

The left part is only 1 prometheus instance with a 30s scrape interval. At 15:14-ish I started a curl loop, i.e. an additional scrape request back to back. This is to simulate what happens if we add more prometheus instances.

Two questions:

Are the numbers reasonable given the number of topics and groups?
What's the expected behavior as we add more scrapers?

It would have guessed that the cost of aggregating the state depends only on Kafka contents, while an additional scrape come at the cost of serializing to the Prometheus text format. Apparently that's not how it works.

weeco commented 1 year ago

Hey @solsson , long time no see 👋 !

I have never really benchmarked KMinion's CPU performance and honestly have never ran against a cluster with so many consumer groups. However, if I understand it correctly it's consuming only 200m CPU is that right? That would sound reasonable to me.

I'm sure the CPU usage heavily depends on the scrape mode too. You can either configure to scrape consumer groups from the consumer_offsets topic or by querying the Kafka API. The over-head for adding more Prometheus instances / requests should be relatively low for KMinion - however as you run more requests KMinion will also send more Kafka API requests to gather all the information it exposes. These may cause additional strain on your Kafka cluster to process these.

solsson commented 1 year ago

Hey @solsson , long time no see 👋 !

That's probably because things are so stable in the streaming part of the infra I'm managing 😄

The reason I started to investigate scrape duration is that we evaluate autoscaling using KEDA with prometheus metrics. To increase responsiveness I wanted to reduce the scrape interval, and obviously the interval must be > duration.

For a mission critical autoscaling stack I have to replicate prometheus, which means concurrent requests to kminion for the same data.

Surprisingly an increase in guaranteed CPU from .25 to .5 core per kminion pod makes no difference to scrape duration. Is that because the bottleneck is the kafka cluster? I noticed one broker's CPU go up by 10-15% after doubling the requests to kminion.

Also with a scrape interval of 5 seconds (below scrape duration) kminion gets OOMKilled even if I raise the mem limit from 250 Mi to 500 Mi. That could be a backlog that builds up in kminion.

Edit: regarding broker CPU I realized it was at least a fivefold increase in kminion scrape requests.

solsson commented 1 year ago

Switching from adminApi to offsets topic for consumer group had a siginificant positive effect (below is CPU).

It's clear where I did the switch, and then at around 20:59 I started the same curl loop as before that scrapes back to back. The impact of that is less dramatic now.

I can also exclude topics and consumer groups. That's great I don't have to start to think about caching 😄

solsson commented 1 year ago

@weeco I'm rebranding this as an issue now, because I struggle to produce a reliable setup for multiple prometheus instances. I'm yet to look at the code, and I'd be grateful for any hypothesis you can contribute.

The issue is that kminion memory load triples under some circumstance. It's typically well below 200Mi but might start crashlooping with OOMKilled at any time.

My plan was to run 2+ replicas of kminion with guaranteed resources behind a regular kubernetes service. I'd scale up when I add more prometheus instances. I currently have two instances with scrapeInterval 5s and one with 30s (we do want to stress test our stack now before we start to depend on it). Given the scrape duration above this means, in theory, constant load on both kminion instances.

Design challenges:

Kubernetes services don't guarantee even load.
During maintenance etc. we're -1 on replicas for short periods of time.

Current behavior: With both replicas ready, occasionally one of them gets OOMKilled. Shortly after the remaining replica, expectedly, also starts crashlooping. Stack down.

Preferred behavior: kminion would drop requests that it is unable to handle. Ocassional scrape errors is fine, while a crashlooping target isn't.

Even if I was to scale up to more replicas, the condition might still happen on random load factors and then escalate.

weeco commented 1 year ago

I think I'd actually look at heap snapshots or analyze where all allocations are going to with some profiling tool. Some increased memory usage is expected as it has to gather a bunch of information (for each new active metrics request) from the cluster, aggregate results etc. and thus the memory usage scales with the number of topics, consumer groups, concurrent active requests

It's likely that I'm doing some expensive double or unnecessary allocations as I've never tuned it to be performant - it has never been an issue for me in the past.

So the first thing I'd do is probably look at the memory usage with pprof

solsson commented 1 year ago

There are no parameters to scrape requests. What do you think about trying to queue up concurrent requests and deliver the next available response to them? I have no clue how that would be done, but it seems like a waste at any given time to do the same kind of processing more than once.

weeco commented 1 year ago

I'm not sure what to think about response caching in kminion, as the prom responses are supposed to be fresh. Some caching is already done under the hood (in KMinion as well as the Kafka client). I do not plan to spend much time on KMinion in the near future. It seems cheaper to increase the memory limit a little bit as it should be stable then even when multiple replicas scraping it, no?

solsson commented 1 year ago

I've tried increasing the memory limit to 5x the regular need, but it still gets OOM's. I suspect that that's because once kminion fails to produce a response the requests pile up.

I agree caching is undesirable. What I propose is more like having a single worker and once a response is ready it is sent to all requests that have been received since last result.

In the short term I might be able to increase stability with a simple proxy sidecar that accepts at most one http connection.

redpanda-data / kminion

Excessive memory use on concurrent scrape requests #215