huawei-noah / streamDM

Stream Data Mining Library for Spark Streaming
http://streamdm.noahlab.com.hk/
Apache License 2.0
492 stars 147 forks source link

Bug Report StreamKM Clusterer #97

Closed ioanna-ki closed 6 years ago

ioanna-ki commented 6 years ago

Bug Report StreamKM Clusterer

Expected behavior

StreamKM should be keeping an up to date coreset tree, while doing kmeans clustering and assigning each input element to its nearest center.

Observed behavior

All the input assigned in one cluster. The counter of instances, the updated bucketmanager and the variable clusters are keeping their values only inside the foreachRDD action. So, when we are calling the assign function, there aren't any data to proceed.

Steps to reproduce the issue

used the iris.arff Command line _./spark.sh "ClusteringTrainEvaluate -c (StreamKM) -s (SocketTextStreamReader)" 1> result_iris_streamKM.txt 2> log_irisstreamKM.log

There isn't an error message but if you print the output of the assign function (clpairs) each element is assigned to cluster's index 0

Infrastructure details

hmgomes commented 6 years ago

Hi @ioanna-ki I believe it is easier for others to test it using the FileReader option, something like:

./spark.sh "ClusteringTrainEvaluate -c (StreamKM) -s (FileReader -f ../data/iris.arff -k 10 -d 10 -i 150)" 1> result_iris_streamKM.txt 2> log_iris_streamKM.log

Best Regards, Heitor

hmgomes commented 6 years ago

Issue addressed by #99

Thanks Ioanna 👍