twitter / AnomalyDetection

Anomaly Detection with R
GNU General Public License v3.0
3.56k stars 779 forks source link

Sporadic anomalies #95

Open skaurus opened 6 years ago

skaurus commented 6 years ago

Hi!

We are using this library to detect anomalies in the number of web requests - to quickly notice potential problems. Detection goes like this: res = AnomalyDetectionTs(data, max_anoms=0.005, direction='both', only_last="hr", plot=FALSE)

data is imported from CSV (data = read.csv("data.csv",head=FALSE)) and have two columns - datetime and number of requests. When it works correctly, it will detect some anomaly and then report it every 5 minutes (the script is called every 5 minutes from cron) for an hour (until it gets out of only_last scope). But sometimes script reports different anomaly at every call, where are really no anomalies. So far it happened two times on holidays. I have to temporarily comment out this script in cron to stop that.

I tried to increase max_anoms and all it does is that reported anomaly moves back in time until it reaches exactly -1h mark. And these are no real anomalies too.

I have a dataset that causes this behavior: https://pastebin.com/raw/7BxkYTJZ (0.5Mb)

What can I do to fix it? I have zero experience with R unfortunately... The script was easy enough to write it, but debugging is over my head.

skaurus commented 6 years ago

And this weekend it happens again.

addos commented 6 years ago

Hey, I don't know much about twitter anomalies, but I was trying to see if anything looked weird from the data you uploaded, and saw some of these. Not sure how accurate they are though. https://pastebin.com/4ZaUXcu2

addos commented 6 years ago

Dec 21 12:03am, Dec 23 12:27pm, Dec 23 12:37pm, and Dec 23 2:08pm might also be anomalous.

skaurus commented 6 years ago

Hey!

What is the meaning of pass 1 and pass 2? Let's take first two rows from pass 1 for example. Looking on a neighbor values, they do not look like anomalies to me. Also, there were no problems at this time as a matter of fact.

There are few more reasons why this looks like a bug. First, no matter how high I set max_anoms, it still finds anomaly somewhere in this data. Second, reported anomaly changes every time it runs (+5 min of data, due to cron settings).

addos commented 6 years ago

Just differences in algorithms. In pass 1, there was definitely a weird dip at dec 18 4:23am or so. But you also have access to the data that these numbers represent, so if you looked at them and know of nothing weird, then probably just false positives.