Sample.stop causing hangs / loss of metrics

ieure commented 5 years ago

We've got a system which is instrumented with Micrometer and is experiencing some issues.

We're using micrometer 1.1.1, and putting metrics into statsd on localhost, which is then pushing them to Datadog. The program is a Dropwizard HTTP service, but also runs a batch process in a background thread multiple times a day.

What we've observed is that the background process grinds to a halt, and the service consumes 100% of a core's CPU. Restarting the service and batch job allows it to complete. The interactive parts of the service continue to respond normally.

We have general metrics around database and HTTP time/rate, but I also put in metrics to diagnose why the batch job is getting wedged -- there's a counter for the number of work items left to complete, the type of item, rate of completion, and so forth.

When it gets wedged, we've seen multiple thread dumps where it appears to be stuck in micrometer code. Here's one from earlier today:

"cron4j::scheduler[105cdac80af9dd344869240500000168e3aee0fc0ddb3009]::executor[105cdac80af9dd347994bc2200000168ecd7eb4178588822]" id=14370 state=RUNNABLE
    at io.micrometer.shaded.reactor.core.publisher.FluxFlatMap$FlatMapMain.tryEmit(FluxFlatMap.java:501)
    at io.micrometer.shaded.reactor.core.publisher.FluxFlatMap$FlatMapInner.onNext(FluxFlatMap.java:943)
    at io.micrometer.shaded.reactor.core.publisher.FluxPeek$PeekSubscriber.onNext(FluxPeek.java:192)
    at io.micrometer.shaded.reactor.core.publisher.UnicastProcessor.drainRegular(UnicastProcessor.java:240)
    at io.micrometer.shaded.reactor.core.publisher.UnicastProcessor.drain(UnicastProcessor.java:312)
    at io.micrometer.shaded.reactor.core.publisher.UnicastProcessor.onNext(UnicastProcessor.java:386)
    at io.micrometer.statsd.internal.LogbackMetricsSuppressingUnicastProcessor.lambda$onNext$0(LogbackMetricsSuppressingUnicastProcessor.java:43)
    at io.micrometer.statsd.internal.LogbackMetricsSuppressingUnicastProcessor$$Lambda$257/409546368.run(Unknown Source)
    at io.micrometer.core.instrument.binder.logging.LogbackMetrics.ignoreMetrics(LogbackMetrics.java:77)
    at io.micrometer.statsd.internal.LogbackMetricsSuppressingUnicastProcessor.onNext(LogbackMetricsSuppressingUnicastProcessor.java:43)
    at io.micrometer.statsd.internal.LogbackMetricsSuppressingUnicastProcessor.onNext(LogbackMetricsSuppressingUnicastProcessor.java:24)
    at io.micrometer.statsd.StatsdTimer.recordNonNegative(StatsdTimer.java:57)
    at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:171)
    at io.micrometer.core.instrument.composite.CompositeTimer.lambda$record$0(CompositeTimer.java:45)
    at io.micrometer.core.instrument.composite.CompositeTimer$$Lambda$246/581318464.accept(Unknown Source)
    at java.lang.Iterable.forEach(Iterable.java:75)
    at io.micrometer.core.instrument.composite.AbstractCompositeMeter.forEachChild(AbstractCompositeMeter.java:47)
    at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:45)
    at io.micrometer.core.instrument.Timer$Sample.stop(Timer.java:254)

And this is one from last week:

"cron4j::scheduler[26174ea747ffdbba0ba0018e00000168a0354e4519fa4ad6]::executor[26174ea747ffdbba6832980f00000168bf237ac1378c26b1]" id=52400 state=RUNNABLE
    at sun.nio.cs.US_ASCII$Encoder.encode(US_ASCII.java:248)
    at java.lang.StringCoding.encode(StringCoding.java:364)
    at java.lang.String.getBytes(String.java:941)
    at io.micrometer.shaded.reactor.netty.NettyOutbound.lambda$sendString$5(NettyOutbound.java:281)
    at io.micrometer.shaded.reactor.netty.NettyOutbound$$Lambda$176/481276971.apply(Unknown Source)
    at io.micrometer.shaded.reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:100)
    at io.micrometer.shaded.reactor.core.publisher.FluxOnBackpressureLatest$LatestSubscriber.drain(FluxOnBackpressureLatest.java:176)
    at io.micrometer.shaded.reactor.core.publisher.FluxOnBackpressureLatest$LatestSubscriber.onNext(FluxOnBackpressureLatest.java:127)
    at io.micrometer.shaded.reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114)
    at io.micrometer.shaded.reactor.core.publisher.FluxBufferPredicate$BufferPredicateSubscriber.emit(FluxBufferPredicate.java:290)
    at io.micrometer.shaded.reactor.core.publisher.FluxBufferPredicate$BufferPredicateSubscriber.onNextNewBuffer(FluxBufferPredicate.java:257)
    at io.micrometer.shaded.reactor.core.publisher.FluxBufferPredicate$BufferPredicateSubscriber.tryOnNext(FluxBufferPredicate.java:214)
    at io.micrometer.shaded.reactor.core.publisher.FluxBufferPredicate$BufferPredicateSubscriber.onNext(FluxBufferPredicate.java:184)
    at io.micrometer.shaded.reactor.core.publisher.FluxFlatMap$FlatMapMain.tryEmit(FluxFlatMap.java:501)
    at io.micrometer.shaded.reactor.core.publisher.FluxFlatMap$FlatMapInner.onNext(FluxFlatMap.java:943)
    at io.micrometer.shaded.reactor.core.publisher.FluxPeek$PeekSubscriber.onNext(FluxPeek.java:192)
    at io.micrometer.shaded.reactor.core.publisher.UnicastProcessor.drainRegular(UnicastProcessor.java:240)
    at io.micrometer.shaded.reactor.core.publisher.UnicastProcessor.drain(UnicastProcessor.java:312)
    at io.micrometer.shaded.reactor.core.publisher.UnicastProcessor.onNext(UnicastProcessor.java:386)
    at io.micrometer.statsd.internal.LogbackMetricsSuppressingUnicastProcessor.lambda$onNext$0(LogbackMetricsSuppressingUnicastProcessor.java:43)
    at io.micrometer.statsd.internal.LogbackMetricsSuppressingUnicastProcessor$$Lambda$257/784934449.run(Unknown Source)
    at io.micrometer.core.instrument.binder.logging.LogbackMetrics.ignoreMetrics(LogbackMetrics.java:77)
    at io.micrometer.statsd.internal.LogbackMetricsSuppressingUnicastProcessor.onNext(LogbackMetricsSuppressingUnicastProcessor.java:43)
    at io.micrometer.statsd.internal.LogbackMetricsSuppressingUnicastProcessor.onNext(LogbackMetricsSuppressingUnicastProcessor.java:24)
    at io.micrometer.statsd.StatsdTimer.recordNonNegative(StatsdTimer.java:57)
    at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:171)
    at io.micrometer.core.instrument.composite.CompositeTimer.lambda$record$0(CompositeTimer.java:45)
    at io.micrometer.core.instrument.composite.CompositeTimer$$Lambda$246/1590075830.accept(Unknown Source)
    at java.lang.Iterable.forEach(Iterable.java:75)
    at io.micrometer.core.instrument.composite.AbstractCompositeMeter.forEachChild(AbstractCompositeMeter.java:47)
    at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:45)
    at io.micrometer.core.instrument.Timer$Sample.stop(Timer.java:254)

The specific code that was being measured is different -- the first is for a DB query which supports a work item (this metric has been in place for ~6 months, since we moved from Dropwizard Metrics + Graphite to Micrometer/statsd/Datadog) -- while the second is the completion of a work item, and is one I added ~3 weeks ago. In both cases, the thing being measured has completed, but execution is stuck inside theSample.stop at the end.

Another thing we've noticed is that there's a correlation between the job wedging and missing data from the new metrics around batch job performance. For example, there's a counter which is set to the total number of work items, then decremented when one completes. When the job wedges, none of this data is available in Datadog. Other metrics from the same host do appear to work, but ones specifically around this job are missing.

I'm looking for some suggestions to track down what's going on here. My hypothesis is that something is preventing metrics from getting into statsd, so they're sitting around in an unbounded queue, which causes Sample.stop to hang.

It seems like it might be related to #462 and #354. There's a suggestion, "[if] the multi thread is important you should use Queues.unboundedMultiproducer" -- However, the docs say nothing at all about this, and I'm not sure how I'd make that change (or if it's likely to change my situation).

Do you have a suggestion for further diagnosing what's happening here?

jkschneider commented 5 years ago

Since statsd sends a payload for every event that happens, I would expect to see Micrometer Statsd to appear in many thread dumps. But these threads are marked as RUNNABLE which indicates to me that we'd expect them to complete normally.

If the job does exceed a resource saturation threshold and is crashing, I would expect telemetry to be unavailable like the rest of the service.

What makes you believe that it is actually stuck in Timer.Sample#stop?

github-actions[bot] commented 9 months ago

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

github-actions[bot] commented 9 months ago

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open.

micrometer-metrics / micrometer

Sample.stop causing hangs / loss of metrics #1232