Open gdabski opened 1 year ago
I just stepped onto this issue.
I think I found the root cause - it is the way how durationBetweenRotatesMillis
is initialized. For some reasons expiry
time from configuration is divided there by bufferLength
(ageBuckets
variable) unlike in TimeWindowMax where expiry
time is taken directly. This means that effective expiry time (when metric is reset to 0 after single request) for max metric is expiry * bufferLength
but for percentiles it is only expiry
.
AFAIK documentation does not reflect this difference and for me it is confusing so I'm assuming it is a bug.
For the default configuration where expiry/step
is set to 1 minute and bufferLength
is set to 3, assuming metrics are scrapped every minute, in unfortunate case when scrapping occurs immediately after metrics are rotated, requests from 20 seconds window out of 1 minute (33% of data) are not taken into account in percentile metrics at all...
I'm planning to create a PR changing durationBetweenRotatesMillis
initialization.
Describe the bug A single recorded time (sample) affects a
Timer
'smax()
for a longer time than percentiles produced by theTimer
. With defaultDistributionStatisticConfig
the effective expiry is one minute for the percentiles and three minutes for max. The issue is related to howTimeWindowMax
andAbstractTimeWindowHistogram
interpret the value ofDistributionStatisticConfig.expiry.
Both use ring buffers of same size to implement the decay, but while the former only moves by one buffer position in intervals equal toexpiry
, the latter is implemented to do a full rotation of the buffer in the same time.Environment
To Reproduce
Prints:
Expected behaviour A single sample ceases to affect percentiles and timer max at the same point in time.
Related issues In #2751 there is a complaint about max not expiring in expected time, but response was that the
TimeWindowMax
usesexpiry
(andbufferLength
) fromDistributionStatisticConfig
right.