micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.45k stars 980 forks source link

MeterRegistry.meterMapLock contention with meters with high dimensionality #4629

Open DanielThomas opened 7 months ago

DanielThomas commented 7 months ago

This very coarse lock on the ConcurrentMap meterMap can lead to excessive blocking in high RPS environment that create meters at a high rate, due to high dimensionality in their tags:

https://github.com/micrometer-metrics/micrometer/blob/d6fc0b8fdcb10bd90707855b67c8d27adf83bee4/micrometer-core/src/main/java/io/micrometer/core/instrument/MeterRegistry.java#L608-L639

From a 30 second profile of one of our applications, at around 200 rps:

gusto-dgs-error-metric

Environment

Confirmed this is an issue at current mainline.

To Reproduce

Concurrently create meters with unique Ids. Note the monitor contention.

Expected behavior

The MeterRegistry should optimize for highly concurrent environments, and avoid coarse locking, particularly on the meter creation hot path. ConcurrentHashMap.computeIfAbsent is atomic, and at a glance, the need for syntheticAssociations isn't clear to me, as it looks to me that the key on the meterMap will have the association, so the locking shouldn't be necessary there either?

DanielThomas commented 7 months ago

Experimenting with this locally, this isn't as straight forward as it first appears, because synthetic meters are recursively registered:

    java.lang.IllegalStateException: Recursive update
        at io.micrometer.core.instrument.MeterRegistry.getOrCreateMeter(MeterRegistry.java:609)
        at io.micrometer.core.instrument.MeterRegistry.registerMeterIfNecessary(MeterRegistry.java:585)
        at io.micrometer.core.instrument.MeterRegistry.registerMeterIfNecessary(MeterRegistry.java:578)
        at io.micrometer.core.instrument.MeterRegistry.gauge(MeterRegistry.java:319)
        at io.micrometer.core.instrument.Gauge$Builder.register(Gauge.java:195)
        at io.micrometer.core.instrument.distribution.HistogramGauges.<init>(HistogramGauges.java:141)
        at io.micrometer.core.instrument.distribution.HistogramGauges.register(HistogramGauges.java:89)
        at io.micrometer.core.instrument.distribution.HistogramGauges.getHistogramGauges(HistogramGauges.java:53)
        at io.micrometer.core.instrument.distribution.HistogramGauges.registerWithCommonFormat(HistogramGauges.java:48)
        at io.micrometer.core.instrument.step.StepMeterRegistry.newTimer(StepMeterRegistry.java:83)
        at io.micrometer.core.instrument.MeterRegistry.lambda$timer$5(MeterRegistry.java:333)
        at io.micrometer.core.instrument.MeterRegistry.lambda$getOrCreateMeter$10(MeterRegistry.java:624)
        at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1708)
        at io.micrometer.core.instrument.MeterRegistry.getOrCreateMeter(MeterRegistry.java:609)
        at io.micrometer.core.instrument.MeterRegistry.registerMeterIfNecessary(MeterRegistry.java:585)
        at io.micrometer.core.instrument.MeterRegistry.timer(MeterRegistry.java:331)
        at io.micrometer.core.instrument.Timer$Builder.register(Timer.java:459)
        at io.micrometer.core.instrument.Timer$Builder.register(Timer.java:453)

So these would have to be lifted out to avoid being recursive, and ideally, also avoid a separate locks for synthetics.

DanielThomas commented 7 months ago

While investigating https://github.com/micrometer-metrics/micrometer/issues/4747 I looked closer at the excessive dimensionality of the metrics for this class of service. It turns out that https://github.com/Netflix/dgs-framework/ tags internally can lead to very high numbers of distinct meters, and due to the lack of expiry, essentially cause a memory leak:

gql.resolver: 39829
dgs.req.passport: 78941
gql.query: 6764525

Those dimensions are excessive even without this monitor to deal with, but lock-free atomic registration and removal of meters by key (and expiry!) would close the gap between plain Spectator and Micrometer significantly.

shakuzen commented 7 months ago

Thanks for taking a look at this. I think there are improvements that can be made here but it's going to take some care due to the very central nature of this code. We've tended toward being overly cautious about correctness over performance. The last relevant change in this code was probably the move off of PCollections to standard Java collections in https://github.com/micrometer-metrics/micrometer/commit/47912661796ff9ae30d9f3326c5921b65bddde5d. I'll have to spend some more time looking at the synthetic meters implementation, and maybe we can add some jcstress tests for the specific scenarios about which we want to guarantee correctness.