huawei-noah / streamDM

Stream Data Mining Library for Spark Streaming
http://streamdm.noahlab.com.hk/
Apache License 2.0
492 stars 147 forks source link

StreamDM-104: Anomaly/Outlier detection #108

Closed hmgomes closed 5 years ago

hmgomes commented 5 years ago

This pull request addresses #107

Summary of the changes

This pull request includes the necessary classes to implement and evaluate anomaly/outlier detection algorithms in StreamDM. There are four new classes in this PR and two new sample synthetic data:

Tests

All tests use two synthetic data streams, namely stream2000_7anom.arff (7 outliers and 1993 normal) and stream2500_51anom.arff (51 outliers and 2449 normal). Both datasets are available as part of this PR.

The expected output for every test: 10 rows of statistics in the results_*.csv file

2000_7anom.arff

./spark.sh "EvaluateOutlierDetection  -o (SWNearestNeighbors -n 100) -s (FileReader -f ../data/stream2000_7anom.arff -k 200 -i 2000) -e (BasicClassificationEvaluator -m) -h -t 0.5"  1> result_stream2000_7anom_k200_i2000_SWNN_n100_t05.csv 2> log_stream2000_7anom_k200_i2000_SWNN_n100_t05.log

stream2500_51anom.arff

./spark.sh "EvaluateOutlierDetection  -o (SWNearestNeighbors -n 100) -s (FileReader -f ../data/stream2500_51anom.arff -k 250 -i 2500) -e (BasicClassificationEvaluator -m) -h -t 0.5"  1> result_stream2500_51anom_k250_i2500_SWNN_n100_t05.csv 2> log_stream2500_51anom_k250_i2500_SWNN_n100_t05.log