Closed ioanna-ki closed 6 years ago
Hi @ioanna-ki,
numInstances
is a var defined within the scope of StreamKM
class, specifically in here, thus updates to it are stored. Perhaps you observed a different error?
Best Regards, Heitor
well I call ClusteringTrainEvaluate with StreamKM as a parameter and text files as an input. The thing is that inside getClusters of StreamKM the paramater numInstances is always zero so kmeans.cluster is never called. I didn't change the code at all. Am I calling the app wrong?
the input that I use is in this form: 1.0 1.0 1.0 2.0 2.0 1.0 etc...
Please show me the command line you used
Best Regards, Heitor
./scripts/spark.sh "ClusteringTrainEvaluate -c (StreamKM) -s (TextStreamReader)"
and the data input is randomtreesampledata.txt eg.
0.0 4.0,2.0,2.0,2.0,1.0,0.8060225109830121,0.9043923203457621,0.8116905753399468,0.35912458818647686,0.2256564761176031 etc...
Hi @ioanna-ki
Apparently the issue is related to TextStreamReader and not StreamKM, for example, using FileReader with an arff file should work, e.g.:
./spark.sh "ClusteringTrainEvaluate -c (StreamKM) -s (FileReader -f ../data/iris.arff -k 10 -d 10 -i 150)" 1> result_iris_streamKM.txt 2> log_iris_streamKM.log
Please check if this one works and let me know.
Best Regards, Heitor
I got an error message. I attached the log here -> log_iris_streamKM.log
Just to confirm one thing, please provide the following details:
Infrastructure details
Ok I run the same on socketTextStreamReader I am getting SSE=680.82440 as a result for iris.arff, but kmeans cluster method is never called and numInstances is zero, also if I print clpairs (the output stream of type (Example,Double)) example has a value, but double (where I am guessing is the center of the cluster) is always zero
UPDATE: java version "1.8.0_144" scala version is 2.11.6 spark version 2.2.0 os: ubuntu 16.4 Spark Standalone cluster mode
You are right, there is indeed something off with getClusters
, thanks for pointing that out @ioanna-ki
Currently, for the given input it assign examples to cluster 0, probably because numInstances
is 0 in getClusters
(as you stated originally) and the actual clustering never gets to be executed.
Looking at another clustering method implementation (i.e. Clustream
) I believe it can be corrected by performing the clustering within the train
method (i.e. the same map where numInstances
is updated).
All in all, we have some options in here:
I suggested option 2 because I will not be able to work on that in the following days, thus it might take a while before a fix is available. If you feel like contributing a solution this, here is your opportunity :)
Best Regards, Heitor
I'll try to fix it, thanks for your time Heitor.
No problem and I am glad you provided all the details and helped uncovering this. I will close this now and we can create another issue later following the template (similar to this one #84)
Best Regards, Heitor
I am new in this kind of programming but I think streamKM has an issue since train method has a numInstances counter inside a foreachRDD. The value of numInstances is not stored and as a result getClusters method does not work properly. Any thoughts?