ioanna-ki commented 6 years ago

Bug Report StreamKM Clusterer

Expected behavior

StreamKM should be keeping an up to date coreset tree, while doing kmeans clustering and assigning each input element to its nearest center.

Observed behavior

All the input assigned in one cluster. The counter of instances, the updated bucketmanager and the variable clusters are keeping their values only inside the foreachRDD action. So, when we are calling the assign function, there aren't any data to proceed.

Steps to reproduce the issue

used the iris.arff Command line _./spark.sh "ClusteringTrainEvaluate -c (StreamKM) -s (SocketTextStreamReader)" 1> result_iris_streamKM.txt 2> log_irisstreamKM.log

There isn't an error message but if you print the output of the assign function (clpairs) each element is assigned to cluster's index 0

Infrastructure details

Java Version: 1.8.0_144
Scala Version: 2.11.6
Spark version: 2.2.0
OS version: ubuntu 16.4
Spark Standalone cluster mode

hmgomes commented 6 years ago

Hi @ioanna-ki I believe it is easier for others to test it using the FileReader option, something like:

./spark.sh "ClusteringTrainEvaluate -c (StreamKM) -s (FileReader -f ../data/iris.arff -k 10 -d 10 -i 150)" 1> result_iris_streamKM.txt 2> log_iris_streamKM.log

Best Regards, Heitor

hmgomes commented 6 years ago

Issue addressed by #99

Thanks Ioanna 👍

huawei-noah / streamDM