numenta / NAB

The Numenta Anomaly Benchmark
GNU Affero General Public License v3.0
1.92k stars 868 forks source link

Issue in interpretaion of labels #351

Open enthu-sh opened 5 years ago

enthu-sh commented 5 years ago

I want to use realAWS dataset for anomaly detection. But the labels for anomalies are not clear. In the labels folder, there are two files: combined_labels.json and combined_windows.json. In these two files for the system the entries do not match. For example:

"realAWSCloudwatch/ec2_cpu_utilization_24ae8d.csv": [ [ "2014-02-26 13:45:00.000000", "2014-02-27 06:25:00.000000" ], [ "2014-02-27 08:55:00.000000", "2014-02-28 01:35:00.000000" ] ],

It is the entry in combined_windows.json and

"realAWSCloudwatch/ec2_cpu_utilization_24ae8d.csv": [ "2014-02-26 22:05:00", "2014-02-27 17:15:00" ],

this is the entry in combined_labels.json.

Why is there such a mismatch. Which file is correct to be used?

subutai commented 5 years ago

The labels represent individual ranges labeled by humans.

NAB uses anomaly windows for scoring because the anomalies are temporal and can span a period of time. These windows are in combined_windows.json and calculated from the individual labels as described in the appendix 'Appendix B: Label combining algorithm' in the NAB whitepaper

Sometimes if two labels are close together, they will be combined into one window, so that might have happened here.