oetiker / rrdtool-1.x

RRDtool 1.x - Round Robin Database
http://www.rrdtool.org
GNU General Public License v2.0
1.01k stars 262 forks source link

Normalization bug #766

Open ab-it opened 7 years ago

ab-it commented 7 years ago

Can't test this with any other version than this at the moment, sorry. RRDtool 1.4.8 I believe this to be a problem since sub-second precision was introduced (but I'm guessing here). When part of a second is unknown and the other part is known, RRDtool normalizes the rate by averaging it with zero. This is especially wrong when recording temperatures for instance. This becomes visible only when using 'N' as timestamps.

In below script I only use rate 100. Due to the nature of using 'N' it is impossible to reproduce the same numbers every time, but I have seen rates as low as 8, resulting from the input 100. It would be possible to approach zero due to this bug.

xff and CF only play a role later in the process. They can be changed without making a difference in showing this problem.

rrdtool create check.rrd --step 1 DS:ds:GAUGE:3:U:U RRA:AVERAGE:0:1:120 rrdtool updatev check.rrd N:100 sleep 1 rrdtool updatev check.rrd N:100 sleep 4 rrdtool updatev check.rrd N:100 sleep 1 rrdtool updatev check.rrd N:100 sleep 1 rrdtool updatev check.rrd N:100 rrdtool fetch check.rrd AVERAGE --start end-15s

1486275014: nan 1486275015: nan 1486275016: nan 1486275017: nan 1486275018: nan 1486275019: nan 1486275020: nan 1486275021: 8.3761000000e+00 1486275022: nan 1486275023: nan 1486275024: nan 1486275025: nan 1486275026: nan 1486275027: 9.7212100000e+01 1486275028: 1.0000000000e+02 1486275029: nan

oetiker commented 7 years ago

And what would you expect to happen ?

ab-it commented 7 years ago

Please understand this problem did not occur before. A step size of 1 second would mean the entered rate would be included 'as-is'.

IMHO treating unknown as zero is wrong. This is also not done in RRAs, so it surprised me to see it happen in the computation of a new data point.

Some choices which do, IMHO, make more sense are: 1: make the interval unknown. Rationale: unknown+number = unknown. 2: make the entire interval equal to the rate, if half or more is known (similar to xff=0.5 in RRA). Full rate, not half of it ! 3: as 2, but make it configurable.

Sorry but I am currently unable to work at this myself. Even worse: I probably cannot upgrade so I am not even able to test a change, if any.

My workaround is going to use timestamps, not "N". As soon as I have time again. Until that moment I have to live with the occasional missed updates (high load, once a day) resulting in my freezer reporting temperatures as high as -0.91 degrees centigrade.

Red line: average temperature. Gray band: from minimum seen to maximum seen.

ds2_week

ab-it commented 7 years ago

Let me reword options 2 and 3: discard the rate if more than half (or a selectable fraction) is unknown, use the computed rate for the entire interval if less than half (or a selectable fraction) is unknown.

I mean: it could happen that the first third of a second has a known rate of 50, the middle third is unknown, and the last third has another known rate of 190. The normalized rate would then be 120. Or unknown. But not (50+0+190)/3=80.

oetiker commented 7 years ago

I don't have the code in my head atm ... but it seemd like a pretty problematic thing in retrospect since the amount of storage available in the relevant part of the rrd file structure is rather limited .. but ... your ideas might well work ... so if you get the time, a PR would be welcome!

dbaarda commented 4 months ago

The correct way to aggregate a mixture of known and unknown rates into a PDP or CDP when there is sufficient known data (more than 50% for a DS's PDP, or less than xff unknown for a RRA's CDP) is to use the average rate over the known intervals included. This means the unknown periods are "assumed to" having the same average rate as the rate for the known interval. Treating the unknown interval as zero will introduce "drop" artifacts in any PDP or CDP aggregated from rates including unknowns.

So IMHO if you are aggregating for an interval where the first 1/3 has rate=50, the middle 1/3 has rate=U, and the last 1/3 has rate=190, the correct aggregated value should be (501/3 + 1901/3)/(2/3) = 120.

So IMHO ab-it@ is correct. The This is a bug and regression. The example updates he presented should have resulted in the RRA having only rates of 100 (where the rate=100 time was more than 50% of the PDP interval) or U (where the U time was more than 50% of the PDP interval). Note xff doesn't apply for RRA's with steps=1, because it's always only including one PDP, and aggregation into PDP's has an effectively hard-coded xff=0.5.

dbaarda commented 4 months ago

I believe this is caused and documented here;

https://github.com/oetiker/rrdtool-1.x/blob/b39df920f0ff31a49460d9872006a2579ee4c7ed/src/rrd_update.c#L1665

The good news is that this is not a fundamental bug/regression in rrd aggregation, its just a rounding error caused by incomplete sub-second time resolution support. It should only really be visible for very small --step settings.