StreamDM-103: Random Forest for Classification

This pull request addresses #105

Summary of the changes

This pull request includes the first version of RandomForest implementation in StreamDM. It is based on the algorithm defined in this paper, without the drift detector and background learner concepts. There is a new class addition with this PR: RandomForest.scala, and changes to classes Node.scala and HoeffdingTree.scala.

Tests

All tests use the electNormNew.arff dataset (available in the project /data directory)

The expected output for every test: 100 rows of statistics in the results_*.txt file

Hyper-parameter -s (number of trees)

number of trees = 10

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -s 10) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_s10.txt 2> log_elec_rf_s10.log

number of trees = 100

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -s 100) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_s100.txt 2> log_elec_rf_s100.log

Setting the hyper-parameters of the base model, i.e., HoeffdingTree

Maximum depth = 5

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l (HoeffdingTree -h 5)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_rf_maxDepth5.txt 2> log_rf_maxDepth5.log

Node learner = majority vote (-l 0)

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l (HoeffdingTree -l 0)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_learningNode0.txt 2> log_elec_rf_learningNode0.log

Hyper-parameter -m (number of randomly selected features to consider at every split). There are 4 options:
- "Specified m" -> absolute number, e.g. 2 features or 5 features.
- "sqrt(M)+1" -> the squared root of the total amount of features M plus one.
- "M-(sqrt(M)+1)" -> the total amount of features minus the sqrt plus one.
- "Percentage" -> value specifying a percentage. If negative, then it corresponds to M - m%

m = 2

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m 2 -o (Specified m)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_m2.txt 2> log_elec_rf_m2.log

m = 30% (Percentage (M * (m / 100)))

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m 30 -o (Percentage)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mPerc30.txt 2> log_elec_rf_mPerc30.log

m = -20%, so actually 80%

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m -20 -o (Percentage)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mPerc80.txt 2> log_elec_rf_mPerc80.log

m = All

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m 8 -o (Specified m)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mAll.txt 2> log_elec_rf_mAll.log

m = more than the amount of available features (-m 60), should default to use all the available features only

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -m 60 -o (Specified m)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mOver.txt 2> log_elec_rf_mOver.log

m = sqrt(M) + 1, should use the squared root of the total amount of features + 1.

./spark.sh "200 EvaluatePrequential -l (meta.RandomForest -l HoeffdingTree -o (sqrt(M)+1)) -s (FileReader -f ../data/electNormNew.arff -k 453 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec_rf_mSqrt.txt 2> log_elec_rf_mSqrt.log

huawei-noah / streamDM