huawei-noah / streamDM

Stream Data Mining Library for Spark Streaming
http://streamdm.noahlab.com.hk/
Apache License 2.0
492 stars 147 forks source link

streamKM counter problem #94

Closed ioanna-ki closed 6 years ago

ioanna-ki commented 6 years ago

I am new in this kind of programming but I think streamKM has an issue since train method has a numInstances counter inside a foreachRDD. The value of numInstances is not stored and as a result getClusters method does not work properly. Any thoughts?

/** 
  *  Maintain the BucketManager for coreset extraction, given an input DStream of Example.
  * @param input a stream of instances
  */
 def train(input: DStream[Example]): Unit = {
   input.foreachRDD(rdd => {
     rdd.foreach(ex => {
       bucketmanager = bucketmanager.update(ex)
       numInstances += 1
     })
   })
 }
hmgomes commented 6 years ago

Hi @ioanna-ki,

numInstances is a var defined within the scope of StreamKM class, specifically in here, thus updates to it are stored. Perhaps you observed a different error?

Best Regards, Heitor

ioanna-ki commented 6 years ago

well I call ClusteringTrainEvaluate with StreamKM as a parameter and text files as an input. The thing is that inside getClusters of StreamKM the paramater numInstances is always zero so kmeans.cluster is never called. I didn't change the code at all. Am I calling the app wrong?

the input that I use is in this form: 1.0 1.0 1.0 2.0 2.0 1.0 etc...

hmgomes commented 6 years ago

Please show me the command line you used

Best Regards, Heitor

ioanna-ki commented 6 years ago

./scripts/spark.sh "ClusteringTrainEvaluate -c (StreamKM) -s (TextStreamReader)"

and the data input is randomtreesampledata.txt eg.

0.0 4.0,2.0,2.0,2.0,1.0,0.8060225109830121,0.9043923203457621,0.8116905753399468,0.35912458818647686,0.2256564761176031 etc...

hmgomes commented 6 years ago

Hi @ioanna-ki

Apparently the issue is related to TextStreamReader and not StreamKM, for example, using FileReader with an arff file should work, e.g.:

./spark.sh "ClusteringTrainEvaluate -c (StreamKM) -s (FileReader -f ../data/iris.arff -k 10 -d 10 -i 150)" 1> result_iris_streamKM.txt 2> log_iris_streamKM.log

Please check if this one works and let me know.

Best Regards, Heitor

ioanna-ki commented 6 years ago

I got an error message. I attached the log here -> log_iris_streamKM.log

hmgomes commented 6 years ago

Just to confirm one thing, please provide the following details:

Infrastructure details

ioanna-ki commented 6 years ago

Ok I run the same on socketTextStreamReader I am getting SSE=680.82440 as a result for iris.arff, but kmeans cluster method is never called and numInstances is zero, also if I print clpairs (the output stream of type (Example,Double)) example has a value, but double (where I am guessing is the center of the cluster) is always zero

UPDATE: java version "1.8.0_144" scala version is 2.11.6 spark version 2.2.0 os: ubuntu 16.4 Spark Standalone cluster mode

hmgomes commented 6 years ago

You are right, there is indeed something off with getClusters, thanks for pointing that out @ioanna-ki Currently, for the given input it assign examples to cluster 0, probably because numInstances is 0 in getClusters (as you stated originally) and the actual clustering never gets to be executed. Looking at another clustering method implementation (i.e. Clustream) I believe it can be corrected by performing the clustering within the train method (i.e. the same map where numInstances is updated).

All in all, we have some options in here:

  1. If it is doable, please use Clustream meanwhile StreamKM is being reviewed and fixed.
  2. If you believe you can provide a fix for StreamKM, please create a separate issue (filling out all the template) and later the pull request with the actual fix.

I suggested option 2 because I will not be able to work on that in the following days, thus it might take a while before a fix is available. If you feel like contributing a solution this, here is your opportunity :)

Best Regards, Heitor

ioanna-ki commented 6 years ago

I'll try to fix it, thanks for your time Heitor.

hmgomes commented 6 years ago

No problem and I am glad you provided all the details and helped uncovering this. I will close this now and we can create another issue later following the template (similar to this one #84)

Best Regards, Heitor