irvingc / dbscan-on-spark

An implementation of DBSCAN runing on top of Apache Spark
Apache License 2.0
183 stars 58 forks source link

java.util.NoSuchElementException: key not found while iterating over model for the second time #9

Open zbytt opened 6 years ago

zbytt commented 6 years ago

Hello ! I'm rather new to the RDD approach, but I think that I found rather strange behavior of the code. It goes as follows :

  1. Train model with DBSCAN.train()
  2. Access model.labeledPoints with an 'action' method
  3. Access model.labeledPoints again, exception is thrown

Problem is easily reprodcible, using your TestSuite:

test("dbscan") {

val data = sc.textFile(getFile(dataFile).toString())
val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val model = DBSCAN.train(parsedData, eps = 0.3F, minPoints = 10, maxPointsPerPartition = 250)
// first eager access
model.labeledPoints.foreach(println(_))
// second access
val secondAccess = model.labeledPoints
  .map(p => (p, p.cluster))
  .collectAsMap()
  .mapValues(x => corresponding(x))

}

For the code above, first all the points are printed, and then :

8/01/10 13:00:52 ERROR Executor: Exception in task 2.0 in stage 13.0 (TID 32) java.util.NoSuchElementException: key not found: (2,3) at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.mllib.clustering.dbscan.DBSCAN$$anonfun$16.apply(DBSCAN.scala:196) at org.apache.spark.mllib.clustering.dbscan.DBSCAN$$anonfun$16.apply(DBSCAN.scala:192) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/01/10 13:00:52 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID 30) java.util.NoSuchElementException: key not found: (0,2) at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.mllib.clustering.dbscan.DBSCAN$$anonfun$16.apply(DBSCAN.scala:196) at org.apache.spark.mllib.clustering.dbscan.DBSCAN$$anonfun$16.apply(DBSCAN.scala:192) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Is that a feature or a bug ?:) Thanks & have a good day !

zbytt commented 6 years ago

Any thoughts on this problem ?

lccmpn commented 6 years ago

Same here, a "java.util.NoSuchElementException: key not found" is thrown while iterating over model for the second time but only if maxPointsPerPartition is smaller than the total points to be clustered.