Poor performance of DefaultLongTaskTimer#takeSnapshot

micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.

https://micrometer.io

Apache License 2.0

4.46k stars 990 forks source link

Poor performance of DefaultLongTaskTimer#takeSnapshot #3993

Open Hartigan opened 1 year ago

Hartigan commented 1 year ago

Poor performance of MicrometerCollector.collect() which uses DefaultLongTaskTimer. DefaultLongTaskTimer contains a lot of calls of ConcurrentLinkedDeque.size() with O(n) complexity during histogram creation.

Micrometer version: 1.10.3

Might be worth saving the size of deque or full snapshot at the beginning of the method DefaultLongTaskTimer.takeSnapshot() execution. Also method may contain data races between histogram creation and updates of activeTasks.

jonatan-ivanov commented 1 year ago

Thank you for the issue!

Do you have benchmarks that shows the bottleneck or is this a theoretical issue? I would assume that you will have issues because of the huge amount of parallel tasks before you would see performance degradations because of size(). This assumption might be incorrect, some benchmarks can demonstrate the correctness of it.

Also method may contain data races between histogram creation and updates of activeTasks.

Could you please elaborate this? Which method and where can race condition arise?

Hartigan commented 1 year ago

True, I have service with huge amount of parallel requests and it's expected load (not issue). And ConcurrentLinkedDeque.size() consumes 8% of flamegraph -- actual issue.

I calculate the algorithmic complexity of this class: DefaultLongTaskTimer store task in ConcurrentLinkedDeque and provide methods start() with O(1) for append task into deque, SampleImpl.stop() -- O(n) for removing random element from deque and takeSnapshot() -- O(n^2) for iterating over deque with size() calls on each iteration.

Could you please elaborate this? Which method and where can race condition arise?

It's theoretical, but you use ConcurrentLinkedDeque. This implies that timers can be added and removed while the histogram is being calculated(calls of start()/SampleImpl.stop() during call of takeSnapshot()).

shakuzen commented 1 year ago

Thank you for the analysis. The DefaultLongTaskTImer's takeSnapshot implementation does not scale well with many active tasks currently. Out of curiosity, what order of magnitude of active tasks are you expecting at a time? 1,000? 1,000,000? I have some benchmarks locally and will try messing around with improving the efficiency, but if you have some suggestion, feel free to propose it in a pull request.

Hartigan commented 1 year ago

I created pull request https://github.com/micrometer-metrics/micrometer/pull/4001

Hartigan commented 1 year ago

If you may sacrifice DefaultLongTaskTimer#max() performance, I can replace Deque by HashMap and improve performance of stop operation. For cases with huge amount of active tasks I see 3-5% of profile in ConcurrentLinkedDeque#remove(Object), because tasks located not at the beginning of deque. Example: 100 active long requests and timer start/stop new short requests. start() -- just append new timer for short request, and stop() must iterate 100 long task for delete short task each time.

fdulger commented 8 months ago

Is this planned to be taken up anytime soon? We're also experiencing similar behavior when we let our application run for a long time. We're using spring boot and micrometer with default configuration and we scrape the metrics with a prometheus server once in a minute. We notice that the CPU usage of the system is going up continuously. If we increase the frequency of scraping, the CPU usage goes up faster. First, the micrometer endpoint starts to respond slower and eventually the entire application gets effected.

Sipaha commented 2 months ago

@fdulger We had the same issue in our system. The problem started after we updated Spring, which now registers its own DefaultMeterObservationHandler. We already had one handler registered, so this caused a conflict. Each handler registers samples in the onStart method, and only last registered sample properly stops in onStop. The others stay in a 'started' state forever, leading to a memory leak and increased CPU usage, just like what you're seeing.

jonatan-ivanov commented 2 months ago

Fyi, since 1.13.0, you can disable LongTaskTimer creation: https://github.com/micrometer-metrics/micrometer/issues/4604. If you are using Spring Boot, there is a property for that: actuator.management.observations.long-task-timer.enabled.

Also, I would really look into why some of your Observations are never stopped (it's likely that there is a problem with your instrumentation). Not stopping Observations or LongTaskTimers are in fact memory leaks and it's a separate problem than this issue (it's likely a problem with the instrumentation not Micrometer).