Open koa opened 6 years ago
I'll state my belief that the amount of work that the pause detector does never changes, that it has a fixed CPU overhead governed by its sleep and pause intervals. Maybe you can demonstrate that I'm wrong in this belief ;)
CPU overhead on pause detection should be inversely proportional to sleep and pause intervals on the detector. By default they are both 100ms. Could you try setting them higher and reobserve?
For example:
registry.config().pauseDetector(new ClockDriftPauseDetector(Duration.ofSeconds(1), Duration.ofSeconds(1))
Interesting. The CPU waste seems to be reduced, but there are still too much increase of *_seconds_count counter. Before (with default settings) the increment was too high by factor of 5k, With your settings there is only 5%.
Pause detection doesn't effect *_seconds_count
, only contributes latency to the *_seconds_sum
. Could you make a reproducible example?
CPU consumption is dependent on the hardware, so there is no one value we can choose to always use 1% or something like that. We've just picked a default that is good for most modern infrastructure.
How does it avoid the increment of _seconds_count? The stacktrace above shows a call of record() on timer (out of PauseDetector). Every call of this method increments _second_count by 1 (PrometheusTimer.java:63).
Thanks. I think I finally understand. So if it's OK I'll consider the CPU discussion over, since that seems to be working as intended. This issue then becomes about ensuring that pause detection contributes to total time and not throughput.
Both symptoms are a bigger problem than a minimal jitter on measured durations (in my situation).
@koa Of course you are free to turn off pause detection if you'd like. I wouldn't understate how significant that jitter can be though. Buddy at Netflix reported a 12 second VM pause in EC2 the other day.
Just reporting that we were also bitten by this bug on an application having long pauses, where our dashboards showed a spike in traffic of 800,000+ requests per second on an endpoint that really had 1 request per minute.
Wish we could have gotten it fixed in time for 1.1.0, but it is not trivial. Will have to put it in a patch release :/
When looking into this, I think we should review some Timer
usage in JvmGcMetrics
(e.g. jvm.gc.pause
) which are used there for convenience I assume. I am not quite sure if these should be enriched with pauses from the PauseDetector
. (When digging into it a bit, these were at least part of triggers and I am not sure they should, but it's beyond my knowledge if LatencyUtils pause detection should or should not add to these timings which we derive from JMX.)
@shakuzen did you observe some crazy jvm.gc.pause
numbers during that "spike", too?
did you observe some crazy jvm.gc.pause numbers during that "spike", too?
@mweirauch Yes. The jvm.gc.pause
rate also has a 800,000+ GCs per second spike on that instance.
@mweirauch Ouch, we don't currently have a way to disable pause detection on a single timer.
I also encountered this problem. How did you solve it?
@lixiang4543 It's currently not solved. You can mitigate this by setting a NoopPauseDetector
. See first post.
@mweirauch How many versions does this problem affect? My version of springboot is: 2.0.2; There is no PrometheusMeterRegistry class。
How did this problem arise?
How to produce : java.lang.Thread.State: RUNNABLE at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:176) at io.micrometer.core.instrument.AbstractTimer.recordValueWithExpectedInterval(AbstractTimer.java:135) at io.micrometer.core.instrument.AbstractTimer.lambda$initPauseDetector$1(AbstractTimer.java:121) at io.micrometer.core.instrument.AbstractTimer$$Lambda$401/1892795103.handlePauseEvent(Unknown Source) at org.LatencyUtils.PauseDetector$PauseDetectorThread.run(PauseDetector.java:107)
I used springboot2.1, which was version 1.1 but still had this problem.*_seconds_count increases crazily to a billion.It's impossible.Now what do I do? We need to write our own set of rules to compute it and then override it, right?
@595979101 You can turn off pause detection altogether for now:
registry.config().pauseDetector(new NoPauseDetector());
@jkschneider thank you very much.
@jkschneider I'm sorry to bother you.I set NoPauseDetector.But *_service_count is still too high.And it happened after the pause.This phenomenon happens from time to time.
@595979101 How on earth does service_seconds_sum
go negative?
@jkschneider I don't know.But there was a period of time before this happened when a pause occurred.I found a daemon thread org. LatencyUtils. SimplePauseDetector.If I kill it.Do you think there will be a problem?
I don't see how such a daemon thread would exist if you were using NoPauseDetector
everywhere. Perhaps you are using it in some places and not others?
We will switch the default to the no-op NoPauseDetector
in 1.0.10, 1.1.4, and 1.2.0 to workaround this issue (see #1324). Hopefully in the future we will resolve this and restore the functionality lost by disabling this.
I reproduced this issue in debug mode on local machine.
This should comes when pause occurs in system. The PauseDetector
detected pause and compensation latency via invoking AbstractTimer#recordValueWithExpectedInterval()
. It compensations time
together with count
.
When long pause occurs, it falls in loop highlighted below. The looped operation will cause decreasing of estimatedInterval
evaluated by TimeCappedMovingAverageIntervalEstimator
, because in #record
method the estimator will record interval as normal record.
During one captured latency compensation, nanoValue
andexpectedIntervalBetweenValueSamples
are 6301897087
and 193
.
The 1st one is much larger than 2nd one. So the loop will run 6301897087 / 193
times.
The benchmark test:
all goes well. The request count is same as request total.
SimplePauseDetector#L124
add a 300ms
stopRequest account is far more that request count. It's still increasing event no request coming.
Lots of mins after jmeter shutdown.
Continue previous comment:
Reproduce case in docker container by trigger full gc via heap dump per seconds, the kept increasing high even no incoming request.
Customized PrometheusMeterRegistery
which creates customized timer which compensate time only once pause occurred.
What is the status of this issue? I spent almost 2 days trying to figure out why my counts were off only to discover there was a background pause detector thread mutating the counts seemingly randomly.
I have a very simple way of reproducing the problem. Below is a parameterized JUnit test recording a Timer duration 10,000,000 times in another (single) thread. The one with the pause detector fails while the one without passes. Another test (not shown below) where the Timer is recorded in the same "main" thread without using another thread / executor actually passes. The only time it fails is when using pause detection in a multi-threaded context. No amount of synchronization helps because the pause detector runs in the background mutating the timer.
package test;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.distribution.pause.NoPauseDetector;
import io.micrometer.core.instrument.simple.SimpleMeterRegistry;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;
import java.util.Arrays;
import java.util.Collection;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import static junit.framework.TestCase.assertEquals;
@RunWith(Parameterized.class)
public class TestTimerCount {
private final long LOOP_TOTAL = 10000000;
private final boolean usePauseDetector;
public TestTimerCount(final boolean usePauseDetector) { this.usePauseDetector = usePauseDetector; }
@Parameterized.Parameters(name = "{index}: test timer with usePauseDetector({0})")
public static Collection<Boolean> data() { return Arrays.asList(true, false); }
class Runner implements Runnable {
final Timer cut;
Runner() {
Timer.Builder builder = Timer.builder(Boolean.toString(usePauseDetector));
if (!usePauseDetector) builder.pauseDetector(new NoPauseDetector());
cut = builder.register(new SimpleMeterRegistry());
}
@Override public void run() { cut.record(1, TimeUnit.MILLISECONDS); }
}
@Test public void test() {
final ExecutorService executorService = Executors.newSingleThreadExecutor();
final Runner runner = new Runner();
for(int ndx = 0; ndx < LOOP_TOTAL; ndx++) executorService.submit(runner);
try {
executorService.shutdown();
executorService.awaitTermination(30, TimeUnit.SECONDS);
} catch (InterruptedException ex) {}
assertEquals(LOOP_TOTAL, runner.cut.count());
}
}
What is the status of this issue?
See https://github.com/micrometer-metrics/micrometer/issues/844#issuecomment-476495955. It should not be an issue for you if you are on a supported version of Micrometer and are not explicitly configuring a pause detector.
@shakuzen - Agreed. That's what my example above shows. From my newbie perspective,
Thanks!
Situation:
Symptoms:
Workaround: Switch off Pause Detection