huawei-noah / streamDM

Stream Data Mining Library for Spark Streaming
http://streamdm.noahlab.com.hk/
Apache License 2.0
492 stars 147 forks source link

Adding multiclass evaluation metrics #77

Closed hmgomes closed 6 years ago

hmgomes commented 6 years ago

The core change is the addition of evaluation metrics for multiclass classification problems. Other ad hoc changes were added, such as removing some debug println(…) messages that were outputted to the results file. Further details about the changes can be found below as well as the tests to verify the evaluation metrics implemented.

DenseInstance.scala and SparseInstance.scala

ClusteringEvaluator.scala, Evaluator.scala and BasicClassificationEvaluator.scala

streamDMJob.scala

FileReader.scala

EvaluatePrequential.scala

BasicClassificationEvaluator.scala

Tests

These tests use the normalized cover type dataset. Instructions to obtain the dataset and prepare for the tests:

  1. Download it from here: https://github.com/hmgomes/AdaptiveRandomForest/blob/master/COVT.arff.zip
  2. Move it to ../data under the streamDM project directory.

OUTPUT: Avg statistics + per class statistics + confusion matrix (full output) ./spark.sh "200 EvaluatePrequential -l (trees.HoeffdingTree -l 0 -t 0.05 -g 200 -o) -s (FileReader -f ../data/covtypeNorm.arff -k 5810 -d 10 -i 581012) -e (BasicClassificationEvaluator) -h" 1> result_COVT.txt 2> log_COVT.log

OUTPUT: Avg statistics + confusion matrix (no per class statistics) ./spark.sh "200 EvaluatePrequential -l (trees.HoeffdingTree -l 0 -t 0.05 -g 200 -o) -s (FileReader -f ../data/covtypeNorm.arff -k 5810 -d 10 -i 581012) -e (BasicClassificationEvaluator -c) -h" 1> result_COVT_noPerclass.txt 2> log_COVT_noPerclass.log

OUTPUT: Avg statistics + per class statistics (no confusion matrix) ./spark.sh "200 EvaluatePrequential -l (trees.HoeffdingTree -l 0 -t 0.05 -g 200 -o) -s (FileReader -f ../data/covtypeNorm.arff -k 5810 -d 10 -i 581012) -e (BasicClassificationEvaluator -m) -h" 1> result_COVT_noConfMat.txt 2> log_COVT_noConfMat.log

OUTPUT: Avg statistics only ./spark.sh "200 EvaluatePrequential -l (trees.HoeffdingTree -l 0 -t 0.05 -g 200 -o) -s (FileReader -f ../data/covtypeNorm.arff -k 5810 -d 10 -i 581012) -e (BasicClassificationEvaluator -c -m) -h" 1> result_COVT_onlyAvg.txt 2> log_COVT_onlyAvg.log