huawei-noah / streamDM

Stream Data Mining Library for Spark Streaming
http://streamdm.noahlab.com.hk/
Apache License 2.0
492 stars 147 forks source link

Using Hoeffding Tree as the base learner for Bagging #84

Closed hmgomes closed 6 years ago

hmgomes commented 6 years ago

Expected behavior

It should be possible to use the Hoeffding Tree classifier as the base learner for the Bagging method. Precisely, I've tried executing Bagging using the Hoeffding Tree as the base learner and it failed.

Observed behavior

No classification results were generated. An exception was raised during execution.

Exception:

17/12/09 23:17:06 ERROR Executor: Exception in task 1.0 in stage 10.0 (TID 21)
java.lang.ClassCastException: org.apache.spark.streamdm.classifiers.trees.HoeffdingTreeModel cannot be cast to org.apache.spark.streamdm.core.ClassificationModel
    at org.apache.spark.streamdm.classifiers.meta.Bagging$$anonfun$ensemblePredict$1.apply$mcVI$sp(Bagging.scala:114)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
    at org.apache.spark.streamdm.classifiers.meta.Bagging.ensemblePredict(Bagging.scala:113)
    at org.apache.spark.streamdm.classifiers.meta.Bagging$$anonfun$predict$1.apply(Bagging.scala:97)
    at org.apache.spark.streamdm.classifiers.meta.Bagging$$anonfun$predict$1.apply(Bagging.scala:97)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Steps to reproduce the issue

Used 2 different datasets (electricity and covertype).

Command line

./spark.sh "200 EvaluatePrequential -l (meta.Bagging -l trees.HoeffdingTree) -s (FileReader -f ../data/elecNormNew.arff -k 4532 -d 10 -i 45312) -e (BasicClassificationEvaluator -c -m) -h" 1> result_elec.txt 2> log_elec.log
./spark.sh "200 EvaluatePrequential -l (meta.Bagging -l trees.HoeffdingTree)-s (FileReader -f ../data/covtypeNorm.arff -k 5810 -d 10 -i 581012) -e (BasicClassificationEvaluator -c -m) -h" 1> result_covt.txt 2> log_covt.log

Data source elecNormNew.arff and covtypeNorm.arff

Infrastructure details

hmgomes commented 6 years ago

Issue addressed by #85