huawei-noah / streamDM

Stream Data Mining Library for Spark Streaming
http://streamdm.noahlab.com.hk/
Apache License 2.0
492 stars 147 forks source link

StreamDM-103: Random Forest for Classification #106

Closed hmgomes closed 6 years ago

hmgomes commented 6 years ago

This pull request addresses #105

Summary of the changes

This pull request includes the first version of RandomForest implementation in StreamDM. It is based on the algorithm defined in this paper, without the drift detector and background learner concepts. There is a new class addition with this PR: RandomForest.scala, and changes to classes Node.scala and HoeffdingTree.scala.

Tests

All tests use the electNormNew.arff dataset (available in the project /data directory)

The expected output for every test: 100 rows of statistics in the results_*.txt file

  1. Hyper-parameter -s (number of trees)

number of trees = 10

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -s 10) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_s10.txt 2> log_elec_rf_s10.log

number of trees = 100

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -s 100) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_s100.txt 2> log_elec_rf_s100.log
  1. Setting the hyper-parameters of the base model, i.e., HoeffdingTree

Maximum depth = 5

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l (HoeffdingTree -h 5)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_rf_maxDepth5.txt 2> log_rf_maxDepth5.log

Node learner = majority vote (-l 0)

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l (HoeffdingTree -l 0)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_learningNode0.txt 2> log_elec_rf_learningNode0.log
  1. Hyper-parameter -m (number of randomly selected features to consider at every split). There are 4 options:
    • "Specified m" -> absolute number, e.g. 2 features or 5 features.
    • "sqrt(M)+1" -> the squared root of the total amount of features M plus one.
    • "M-(sqrt(M)+1)" -> the total amount of features minus the sqrt plus one.
    • "Percentage" -> value specifying a percentage. If negative, then it corresponds to M - m%

m = 2

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m 2 -o (Specified m)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_m2.txt 2> log_elec_rf_m2.log

m = 30% (Percentage (M * (m / 100)))

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m 30 -o (Percentage)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mPerc30.txt 2> log_elec_rf_mPerc30.log

m = -20%, so actually 80%

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m -20 -o (Percentage)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mPerc80.txt 2> log_elec_rf_mPerc80.log

m = All

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m 8 -o (Specified m)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mAll.txt 2> log_elec_rf_mAll.log

m = more than the amount of available features (-m 60), should default to use all the available features only

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m 60 -o (Specified m)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mOver.txt 2> log_elec_rf_mOver.log

m = sqrt(M) + 1, should use the squared root of the total amount of features + 1.

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -o (sqrt(M)+1)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mSqrt.txt 2> log_elec_rf_mSqrt.log
hmgomes commented 6 years ago

Dear @zhangjiajin,

Please check if the latest changes are sufficient

Cheers, Heitor