nathanielc / morgoth

Metric anomaly detection
http://docs.morgoth.io
Apache License 2.0
280 stars 31 forks source link

Detects every data point as anamolous. #41

Open divgwd opened 7 years ago

divgwd commented 7 years ago

Hi, I am consuming system metrics data from kafka and inerting the same into influx where the margoth script runs to detect any anomalous system behaviour based on the metrics, but the problem is every metric morgoth receives it's logging it as anomalous,have attached a screenshot of my dataset and the anomalous data set as well as my tick script for the same. Any help is appreciated! Thanks in advance!

from my influxdb: select VALUE,anomalyScore from anomaly_cpu where NODE_NAME='PROC-1' AND METRIC='system.cpu.idle' AND time >= 1487919960000000000 and time <=1487920500000000000; name: anomaly_cpu time VALUE anomalyScore


1487919960000000000 87.2257319 0.95 1487920020000000000 87.2394801 0.9523809523809523 1487920080000000000 87.2379489 0.9545454545454546 1487920140000000000 87.2407403 0.9565217391304348 1487920200000000000 87.2488526 0.9583333333333334 1487920260000000000 87.2469828 0.96 1487920320000000000 87.2715053 0.9615384615384616 1487920380000000000 87.245502 0.962962962962963 1487920440000000000 87.2551249 0.9642857142857143 1487920500000000000 87.2479187 0.9655172413793104

select VALUE from system where NODE_NAME='PROC-1' AND METRIC='system.cpu.idle' AND time >= 1487919960000000000 and time <=1487920500000000000; name: system time VALUE


1487919960000000000 87.2257319 1487920020000000000 87.2394801 1487920080000000000 87.2379489 1487920140000000000 87.2407403 1487920200000000000 87.2488526 1487920260000000000 87.2469828 1487920320000000000 87.2715053 1487920380000000000 87.245502 1487920440000000000 87.2551249 1487920500000000000 87.2479187

cpu_alert_tick.docx

nathanielc commented 7 years ago

@divgwd Can you share your TICKscript as well?

divgwd commented 7 years ago

I attached it as a doc .

On 24-Feb-2017 10:11 pm, "Nathaniel Cook" notifications@github.com wrote:

@divgwd https://github.com/divgwd Can you share your TICKscript as well?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nathanielc/morgoth/issues/41#issuecomment-282339444, or mute the thread https://github.com/notifications/unsubscribe-auth/AWQcFc47AgIdlwRz5Ra0p-TDtfCYm5mLks5rfwhWgaJpZM4MLA2G .

nathanielc commented 7 years ago

@divgwd Sorry I missed that. It looks like your data points are 1m apart. You are windowing the data into 1m buckets, which means that each window only has a single point. Morgoth expects more points per window. Typically 30-60 points is best. Since Morgoth is calculating the mean and stddev of the window it is getting a stddev of 0 for windows with a single point. As a result all points are being marked as anomalous since all points are more than 3*stdedev = 0 away from the mean.

Either increase the frequency at which you are collecting data or increase the Morgoth window to 10m or something larger.

divgwd commented 7 years ago

Thanks, That worked!! I have a doubt , so everytime I disable and enable the morgoth task will it be like a new task or will it continue as it previously did?? My tick script is monitoring anomalous cpu idle usage for my server but even after stressing the server manually I am not able to see anomalous data detected,have attached a screenshot of graph plotted in grafana. my window is 10 m, error tolerance is 0.00 and am collecting system metric every 10s using telegraf. morgoth_anomaly

nathanielc commented 7 years ago

Perfect! Now that you have a real data set that you want to detect anomalies on you can use the record/replay features of Kapacitor to iterate on the Morgoth specific settings to catch the anomalies.

Something like this should work.

# Create a recording
kapacitor record query -type stream -query 'select cpu_usage_idle from telegraf.autogen.cpu where <time range of above graph>'
# Remember to recording ID that is printed from the above command

#Repeat as  needed
# Update the task with new settings
kapacitor define cpu_morgoth_task -tick cpu_morgoth_task.tick
# Replay the recording against the Morgoth task
kapacitor replay -recording RECORDING_ID -task cpu_morgoth_task -rec-time

Then you can iterate on the task until it catches the anomalies you want and nothing more.

A few suggestions to get started.

Finally be careful not to overfit your data, meaning if you tune the task to precisely catch the anomalies in your recording it may not generalize well at catching new anomalies later. You can test this by making a new recording and testing against it after you have it working on your first recording.

In answer to your question above:

so everytime I disable and enable the morgoth task will it be like a new task or will it continue as it previously did??

It will start over like a new task. The plumbing is in place to preserve it but it is not quite implemented yet. I'll open an issue to track getting that finished.

divgwd commented 7 years ago

Hi, I have morgoth running on system metrics which come at intervals of 10s,window for morgoth is 10 mins , so, I have 60 samples per morgoth window,the error tolerance and min support are set to default values and sigma is 3.3.The system is pretty stagnant without much activity yet morgoth detects anomalous data when there is no data received for those metrics .I have attached snapshots of the same. Is there something I have missed or overlooked ?? conn_aborts conn_aborts1 Thanks in advance :)