Match error for some datasets

barrybecker4 commented 8 years ago

This is another strange bug. It may be related to #14. I have two different versions of the titanic dataset that both contain an integer valued "parch" column with values like 0, 1, 2, or 5, but one dataset has many other columns removed. I find that the version that just has a few columns works when binning all continuous columns, but the one with all the columns, has a problem binning the "parch" column. The error is

ERROR Executor: Exception in task 1.0 in stage 339.0 (TID 518) scala.MatchError: (0.0,(5,[0,2],[NaN,2.0])) (of class org.apache.spark.mllib.regression.LabeledPoint) at org.apache.spark.mllib.feature.MDLPDiscretizer$$anonfun$7.apply(MDLPDiscretizer.scala:144) at org.apache.spark.mllib.feature.MDLPDiscretizer$$anonfun$7.apply(MDLPDiscretizer.scala:144) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

I have a reproducible test case and will look at finding a fix. Its odd because it demonstrates that there must be some interaction between features in the feature vector as they get discretized. Some state must be getting carried over from one to the next as processing progresses.

barrybecker4 commented 8 years ago

As I delve into this, its looking like it may not be a bug in the MLDPDiscretizer code so much as a problem with the VectorAssembler. The data is read ok, but then when I apply the VectorAssembler to make the feature vector to pass to the discretizer, I noticed that among all the correct vectors are an occasional anomalous vector that has the form that is causing the problem above. Below are a few rows from printing the table with the vector. Look closely at the final vector column and you will see one of the anomalous values:

| Yes| McCoy; Mr. Bernard| male| NaN| 367226| 23.25| null| 3.0| Q| 2.0| 0.0| Q| 2.0|[NaN,23.25,3.0,2....| | No|Johnson; Mr. Will...| male|19.0| LINE| 0.0| null| 3.0| S| 0.0| 0.0| S| 0.0|(5,[0,2],[19.0,3.0])| | Yes| Keane; Miss. Nora A|female| NaN| 226593| 12.35| E101| 2.0| Q| 0.0| 0.0| Q| 2.0|[NaN,12.35,2.0,0....| | No|Williams; Mr. How...| male| NaN| A/5 2466| 8.05| null| 3.0| S| 0.0| 0.0| S| 0.0|[NaN,8.05,3.0,0.0...|

barrybecker4 commented 8 years ago

I figured out the bug after a couple of hours of stepping through in the debugger. The anomalous value above is a sparse vector! Some of the feature vector values are represented as sparse vectors instead of dense vectors because the VectorAssembler decided that it would use less memory to represent them that way. The MDLPdiscretizer code, however, is assuming all the values are either dense or sparse based on the first row - but that is not a valid assumption, unfortunately. Now that I understand the problem, a fix should be easy.

sramirez / spark-MDLP-discretization

Match error for some datasets #16