python-diamond / Diamond

Diamond is a python daemon that collects system metrics and publishes them to Graphite (and others). It is capable of collecting cpu, memory, network, i/o, load and disk metrics. Additionally, it features an API for implementing custom collectors for gathering metrics from almost any source.
http://diamond.readthedocs.org/
MIT License
1.74k stars 601 forks source link

network.py does not handle interface resets properly #663

Closed isotopp closed 2 years ago

isotopp commented 7 years ago

The data presented in /proc/net/dev for tx/rx bytes are counters, not rates.

The code in https://github.com/python-diamond/Diamond/blob/master/src/collectors/network/network.py#L116 tries to handle this, but does things wrongly.

The code present handles overflows by substracting 2^64. It does not take into account interface resets, which would also create negative deltas, but not with a step size of 2^64.

Consequently, whenever an interface resets, the data sent to graphite is incorrectly mangled and produces petabyte sized data rate peaks. They are triggering alerts and mangle graph scaling.

It would in fact be better to either report the counter value unmangled and use nonNegativeDerivative() in graphite to handle this, or duplicate the code from there in diamond, as this code handles resets to 0 as well as overflows (which, in 64 bit counters, generally hardly happen).

isotopp commented 7 years ago

Graphite code in https://github.com/graphite-project/graphite-web/blob/master/webapp/graphite/render/functions.py#L1653

This code will first check if a maxValue is present (can be None). If a downstep is observed, a value of None is produced (i.e. the measurement is marked as invalid). That's better than logging Petabytes.