influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.82k stars 3.55k forks source link

Large Derivative Values on New Tag #14886

Open iridos opened 5 years ago

iridos commented 5 years ago

Bug: DERIVATIVE produces a large spike (of the total size of the counter) when an additional tag is added to a counter and when derivative(counter) is grouped by that tag. Analysis: DERIVATIVE seems to interpret a missing metric ("NULL"??) as zero and calculates a derivative between the total counter value and zero. Fix: do not interpret a missing metric as 0 but do not form a derivative when one of the two values is NULL.

Long story: I just started using Influxdb to monitor the IO bandwidth used and metadata of compute jobs used on the central lustre system. The lustre client stats are reported as a counter, do I use non_negative_derivative(max("read_bytes"), 1s) to show the data rates.

I can display those per node using a nested query to show only the nodes that have the biggest bandwidth. What really interests us is the usage per user, though. For that the exec script determines which user runs a job on the node and adds a tag with the username to the metric. Now I can GROUP BY user instead of GROUP BY host.

But there are huge peaks when a user-tag is first introduced to the metric by the collecting script. What seems to happen is, that the derivative function interprets the missing metric as 0 and forms the derivative of (countervalue-0)/time.

Workaround: I have tried to drop those values with: SELECT non_negative_derivative(max("read_bytes"), 1s)/(1+10000000000000*derivative(count("read_bytes"))) … This works for many cases, but fails when the user has a lot of very short jobs, as then one stopped job and one started job in the same minute also create derivative(count("read_bytes")) == 0, but the difference between the counters on the two nodes is still much larger than any regular IO.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

iridos commented 4 years ago

This issue has been marked as stale, because it has been ignored for 3 months. Yay.