sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Need to handle nulls as a separate bin #11

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

What do we do if the column to be binned contains null (i.e. NaN)? The resulting binning currently does not say anything about nulls, but I think it needs to. There should probably be a separate null bin. Not sure exactly how to do this. From my experience with QuantileDiscretizer, it seems to be adding NaN splits to the end of the list of splits, but I'm not sure that is right either. Maybe the first (or last) split could be NaN if any of the values in the column are NaN. Or maybe there should always be a NaN split at beginning (or end) because future data may have NaN even if the training/fitting data did not.

barrybecker4 commented 8 years ago

I opened a very similar bug against the QuantileDiscretizer in spark. Whatever they decide to do there should probably be done here too. See https://issues.apache.org/jira/browse/SPARK-17219

sramirez commented 8 years ago

OK, I think I'm going to wait for the solution in this issue to imitate it.