Open Hartigan opened 1 year ago
Thank you for the issue!
Do you have benchmarks that shows the bottleneck or is this a theoretical issue?
I would assume that you will have issues because of the huge amount of parallel tasks before you would see performance degradations because of size()
. This assumption might be incorrect, some benchmarks can demonstrate the correctness of it.
Also method may contain data races between histogram creation and updates of activeTasks.
Could you please elaborate this? Which method and where can race condition arise?
True, I have service with huge amount of parallel requests and it's expected load (not issue). And ConcurrentLinkedDeque.size()
consumes 8% of flamegraph -- actual issue.
I calculate the algorithmic complexity of this class:
DefaultLongTaskTimer
store task in ConcurrentLinkedDeque
and provide methods start()
with O(1) for append task into deque
, SampleImpl.stop()
-- O(n) for removing random element from deque
and takeSnapshot()
-- O(n^2) for iterating over deque with size() calls on each iteration
.
Could you please elaborate this? Which method and where can race condition arise?
It's theoretical, but you use ConcurrentLinkedDeque
. This implies that timers can be added and removed while the histogram is being calculated(calls of start()/SampleImpl.stop()
during call of takeSnapshot()
).
Thank you for the analysis. The DefaultLongTaskTImer's takeSnapshot implementation does not scale well with many active tasks currently. Out of curiosity, what order of magnitude of active tasks are you expecting at a time? 1,000? 1,000,000? I have some benchmarks locally and will try messing around with improving the efficiency, but if you have some suggestion, feel free to propose it in a pull request.
I created pull request https://github.com/micrometer-metrics/micrometer/pull/4001
If you may sacrifice DefaultLongTaskTimer#max()
performance, I can replace Deque
by HashMap
and improve performance of stop
operation. For cases with huge amount of active tasks I see 3-5% of profile in ConcurrentLinkedDeque#remove(Object)
, because tasks located not at the beginning of deque. Example: 100 active long requests and timer start/stop new short requests. start()
-- just append new timer for short request, and stop()
must iterate 100 long task for delete short task each time.
Is this planned to be taken up anytime soon? We're also experiencing similar behavior when we let our application run for a long time. We're using spring boot and micrometer with default configuration and we scrape the metrics with a prometheus server once in a minute. We notice that the CPU usage of the system is going up continuously. If we increase the frequency of scraping, the CPU usage goes up faster. First, the micrometer endpoint starts to respond slower and eventually the entire application gets effected.
@fdulger We had the same issue in our system. The problem started after we updated Spring, which now registers its own DefaultMeterObservationHandler. We already had one handler registered, so this caused a conflict. Each handler registers samples in the onStart method, and only last registered sample properly stops in onStop. The others stay in a 'started' state forever, leading to a memory leak and increased CPU usage, just like what you're seeing.
Fyi, since 1.13.0
, you can disable LongTaskTimer
creation: https://github.com/micrometer-metrics/micrometer/issues/4604. If you are using Spring Boot, there is a property for that:
actuator.management.observations.long-task-timer.enabled
.
Also, I would really look into why some of your Observation
s are never stopped (it's likely that there is a problem with your instrumentation). Not stopping Observation
s or LongTaskTimer
s are in fact memory leaks and it's a separate problem than this issue (it's likely a problem with the instrumentation not Micrometer).
Poor performance of
MicrometerCollector.collect()
which usesDefaultLongTaskTimer
.DefaultLongTaskTimer
contains a lot of calls ofConcurrentLinkedDeque.size()
withO(n)
complexity during histogram creation.Micrometer version: 1.10.3
Might be worth saving the size of deque or full snapshot at the beginning of the method
DefaultLongTaskTimer.takeSnapshot()
execution. Also method may contain data races between histogram creation and updates ofactiveTasks
.