grafana / promql-anomaly-detection

A framework for anomaly detection using Prometheus and PromQL
Apache License 2.0
138 stars 4 forks source link

Problem of aligning boundary strips with real traffic #5

Open floppy84 opened 3 weeks ago

floppy84 commented 3 weeks ago

Hello and thank you for your work. I tried to reach the slack channel without success so I ask the question here. I observe a shift in the definition of the high and low bands, compared to my actual traffic. And this I observe over a week of data. Which parameter should I use to realign my band limits with my actual traffic?

Image

Image

If I add a 30-minute offset to my actual traffic, I get a good match. But I don't know what the problem is.

Image

jcreixell commented 3 weeks ago

hi @floppy84 , thank you for reporting this issue. The delay is a byproduct of the usage of z-score for anomaly detection. The mid-line used as a reference to draw the baselines is calculated using a moving average, which smooths out the metric but also introduces a delay. In a way, this is by design so that sudden changes fall outside the baselines and are detected as anomalies before the baselines have the chance to catch up (the algorithm is tuned for detecting short term anomalies).

If this is not acceptable in your case, you could tune the size of the time window for the moving average (by default 1h). A large time window will make the mid-line react slowly to trend changes (leading to a more sensitive algorithm), while a shorter time window will make the mid-line track your metric more closely, at the cost of reduced sensitivity and noise.

Another option could be to replace the moving average with an exponential moving average, which gives more importance to recent data points, but it could be computationally expensive and we have not explored this path (you could probably approximate a moving average using the built-in holt winters function in prometheus).

I hope this helps and please let us know if you found a solution that worked for you!