micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.45k stars 980 forks source link

Pause Detection contributes to count when it should just contribute to total time. #844

Open koa opened 6 years ago

koa commented 6 years ago

Situation:

Symptoms:

Workaround: Switch off Pause Detection

@Bean
public MeterRegistryCustomizer<PrometheusMeterRegistry> registryConfig() {
    return r -> r.config().pauseDetector(new NoPauseDetector());
}
jkschneider commented 6 years ago

I'll state my belief that the amount of work that the pause detector does never changes, that it has a fixed CPU overhead governed by its sleep and pause intervals. Maybe you can demonstrate that I'm wrong in this belief ;)

CPU overhead on pause detection should be inversely proportional to sleep and pause intervals on the detector. By default they are both 100ms. Could you try setting them higher and reobserve?

For example:

registry.config().pauseDetector(new ClockDriftPauseDetector(Duration.ofSeconds(1), Duration.ofSeconds(1))
koa commented 6 years ago

Interesting. The CPU waste seems to be reduced, but there are still too much increase of *_seconds_count counter. Before (with default settings) the increment was too high by factor of 5k, With your settings there is only 5%.

jkschneider commented 6 years ago

Pause detection doesn't effect *_seconds_count, only contributes latency to the *_seconds_sum. Could you make a reproducible example?

CPU consumption is dependent on the hardware, so there is no one value we can choose to always use 1% or something like that. We've just picked a default that is good for most modern infrastructure.

koa commented 6 years ago

How does it avoid the increment of _seconds_count? The stacktrace above shows a call of record() on timer (out of PauseDetector). Every call of this method increments _second_count by 1 (PrometheusTimer.java:63).

jkschneider commented 6 years ago

Thanks. I think I finally understand. So if it's OK I'll consider the CPU discussion over, since that seems to be working as intended. This issue then becomes about ensuring that pause detection contributes to total time and not throughput.

koa commented 6 years ago

Both symptoms are a bigger problem than a minimal jitter on measured durations (in my situation).

jkschneider commented 6 years ago

@koa Of course you are free to turn off pause detection if you'd like. I wouldn't understate how significant that jitter can be though. Buddy at Netflix reported a 12 second VM pause in EC2 the other day.

shakuzen commented 5 years ago

Just reporting that we were also bitten by this bug on an application having long pauses, where our dashboards showed a spike in traffic of 800,000+ requests per second on an endpoint that really had 1 request per minute.

jkschneider commented 5 years ago

Wish we could have gotten it fixed in time for 1.1.0, but it is not trivial. Will have to put it in a patch release :/

mweirauch commented 5 years ago

When looking into this, I think we should review some Timer usage in JvmGcMetrics (e.g. jvm.gc.pause) which are used there for convenience I assume. I am not quite sure if these should be enriched with pauses from the PauseDetector. (When digging into it a bit, these were at least part of triggers and I am not sure they should, but it's beyond my knowledge if LatencyUtils pause detection should or should not add to these timings which we derive from JMX.)

@shakuzen did you observe some crazy jvm.gc.pause numbers during that "spike", too?

shakuzen commented 5 years ago

did you observe some crazy jvm.gc.pause numbers during that "spike", too?

@mweirauch Yes. The jvm.gc.pause rate also has a 800,000+ GCs per second spike on that instance.

jkschneider commented 5 years ago

@mweirauch Ouch, we don't currently have a way to disable pause detection on a single timer.

lixiang4543 commented 5 years ago

I also encountered this problem. How did you solve it?

mweirauch commented 5 years ago

@lixiang4543 It's currently not solved. You can mitigate this by setting a NoopPauseDetector. See first post.

lixiang4543 commented 5 years ago

@mweirauch How many versions does this problem affect? My version of springboot is: 2.0.2; There is no PrometheusMeterRegistry class。

lixiang4543 commented 5 years ago

How did this problem arise?

tobbco commented 5 years ago

How to produce : java.lang.Thread.State: RUNNABLE at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:176) at io.micrometer.core.instrument.AbstractTimer.recordValueWithExpectedInterval(AbstractTimer.java:135) at io.micrometer.core.instrument.AbstractTimer.lambda$initPauseDetector$1(AbstractTimer.java:121) at io.micrometer.core.instrument.AbstractTimer$$Lambda$401/1892795103.handlePauseEvent(Unknown Source) at org.LatencyUtils.PauseDetector$PauseDetectorThread.run(PauseDetector.java:107)

595979101 commented 5 years ago

I used springboot2.1, which was version 1.1 but still had this problem.*_seconds_count increases crazily to a billion.It's impossible.Now what do I do? We need to write our own set of rules to compute it and then override it, right?

jkschneider commented 5 years ago

@595979101 You can turn off pause detection altogether for now:

registry.config().pauseDetector(new NoPauseDetector());
595979101 commented 5 years ago

@jkschneider thank you very much.

595979101 commented 5 years ago

@jkschneider I'm sorry to bother you.I set NoPauseDetector.But *_service_count is still too high.And it happened after the pause.This phenomenon happens from time to time. image image

jkschneider commented 5 years ago

@595979101 How on earth does service_seconds_sum go negative?

595979101 commented 5 years ago

@jkschneider I don't know.But there was a period of time before this happened when a pause occurred.I found a daemon thread org. LatencyUtils. SimplePauseDetector.If I kill it.Do you think there will be a problem?

jkschneider commented 5 years ago

I don't see how such a daemon thread would exist if you were using NoPauseDetector everywhere. Perhaps you are using it in some places and not others?

shakuzen commented 5 years ago

We will switch the default to the no-op NoPauseDetector in 1.0.10, 1.1.4, and 1.2.0 to workaround this issue (see #1324). Hopefully in the future we will resolve this and restore the functionality lost by disabling this.

addozhang commented 5 years ago

I reproduced this issue in debug mode on local machine.

This should comes when pause occurs in system. The PauseDetector detected pause and compensation latency via invoking AbstractTimer#recordValueWithExpectedInterval(). It compensations time together with count.

When long pause occurs, it falls in loop highlighted below. The looped operation will cause decreasing of estimatedInterval evaluated by TimeCappedMovingAverageIntervalEstimator, because in #record method the estimator will record interval as normal record.

During one captured latency compensation, nanoValueandexpectedIntervalBetweenValueSamples are 6301897087 and 193. The 1st one is much larger than 2nd one. So the loop will run 6301897087 / 193 times.

image

image

The benchmark test:

  1. jmeter: 2 threads and 20k requests total.

all goes well. The request count is same as request total.

image

image

  1. At SimplePauseDetector#L124 add a 300ms stop

Request account is far more that request count. It's still increasing event no request coming.

image

image

Lots of mins after jmeter shutdown. image

addozhang commented 5 years ago

Continue previous comment:

reproduce

Reproduce case in docker container by trigger full gc via heap dump per seconds, the kept increasing high even no incoming request.

image

image

workaround solution

Customized PrometheusMeterRegistery which creates customized timer which compensate time only once pause occurred.

image

image

markmaxey commented 3 years ago

What is the status of this issue? I spent almost 2 days trying to figure out why my counts were off only to discover there was a background pause detector thread mutating the counts seemingly randomly.

I have a very simple way of reproducing the problem. Below is a parameterized JUnit test recording a Timer duration 10,000,000 times in another (single) thread. The one with the pause detector fails while the one without passes. Another test (not shown below) where the Timer is recorded in the same "main" thread without using another thread / executor actually passes. The only time it fails is when using pause detection in a multi-threaded context. No amount of synchronization helps because the pause detector runs in the background mutating the timer.

package test;

import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.distribution.pause.NoPauseDetector;
import io.micrometer.core.instrument.simple.SimpleMeterRegistry;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;

import java.util.Arrays;
import java.util.Collection;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import static junit.framework.TestCase.assertEquals;

@RunWith(Parameterized.class)
public class TestTimerCount {
    private final long LOOP_TOTAL = 10000000;
    private final boolean usePauseDetector;

    public TestTimerCount(final boolean usePauseDetector) { this.usePauseDetector = usePauseDetector; }

    @Parameterized.Parameters(name = "{index}: test timer with usePauseDetector({0})")
    public static Collection<Boolean> data() { return Arrays.asList(true, false); }

    class Runner implements Runnable {
        final Timer cut;

        Runner() {
            Timer.Builder builder = Timer.builder(Boolean.toString(usePauseDetector));
            if (!usePauseDetector) builder.pauseDetector(new NoPauseDetector());
            cut = builder.register(new SimpleMeterRegistry());
        }

        @Override public void run() { cut.record(1, TimeUnit.MILLISECONDS); }
    }

    @Test public void test() {
        final ExecutorService executorService = Executors.newSingleThreadExecutor();
        final Runner runner = new Runner();

        for(int ndx = 0; ndx < LOOP_TOTAL; ndx++) executorService.submit(runner);

        try {
            executorService.shutdown();
            executorService.awaitTermination(30, TimeUnit.SECONDS);
        } catch (InterruptedException ex) {}

        assertEquals(LOOP_TOTAL, runner.cut.count());
    }
}
shakuzen commented 3 years ago

What is the status of this issue?

See https://github.com/micrometer-metrics/micrometer/issues/844#issuecomment-476495955. It should not be an issue for you if you are on a supported version of Micrometer and are not explicitly configuring a pause detector.

markmaxey commented 3 years ago

@shakuzen - Agreed. That's what my example above shows. From my newbie perspective,

  1. I believe most people would expect the count should represent the number of samples collected by calling the Timer API. Letting the pause detector (or anything else) add additional samples causes confusion to users looking at the counts and rates. In my case, the counts were up to 3 times more than the actual count. This is not a small unnoticeable skew.
  2. At the very least, the "micrometer concepts" documentation on timer pause detectors needs to have some documentation saying that using a pause detector mutates the count as well as the values. It may be obvious once you read up on how the pause detector works or once you look at the code, but it isn't obvious from reading the documentation.
  3. I think my biggest concern is abated by #1324.

Thanks!