sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Null values in the label (target) cause NPE #8

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

I noticed that if I specified a label column that has null values, then a NPE is thrown. Possible options for resolving:

sramirez commented 8 years ago

Missing values can be solved by many alternative solutions. There are a lot of contributions about this topic ("missing values imputation"). Taking a predefined solution in advance could be incorrect. Then I think we can give an exception with a better message. I'm going to fix this problem with the following code:

if (!data.filter(_.label == null).isEmpty())
  logError("Some null values have been found in the output column."
      + " This problem must be fixed before continuing with discretization.")  
barrybecker4 commented 8 years ago

I think the algorithm should allow two options when it comes to null values - either ignore them (the default) or give an error. If the user wants nulls treated as a separate class, client code could replace nulls with a special null string value before applying an indexer. I prefer to ignore the null label values since they are legitimate values, and I do not necessarily want to try and impute them. There could be an optional nullLabelValuePolicy param to configure this. If ignoring is what I want, and an error is given, then I really do not have a workaround. I'd be ok if without the extra null policy param if ignoring was the only behavior.

The proposed error message says that nulls were in the output column, but I think you mean labelColumn.

barrybecker4 commented 8 years ago

I will make a PR to add a new nullLabelPolicy param which will have two possible values: IGNORE_NULLS or ERROR_ON_NULLS. I will have the default be ERROR_ON_NULLS so the behavior that Sergio recently added will me maintained, but at then I will have the option to IGNORE_NULLS in the label column - which is the behavior I need in my client code.

barrybecker4 commented 8 years ago

After doing some experimentation, I think I might be changing my opinion on this. Since the label must already be an integer index (and not string) it means that a StringIndexer has already been called on the label column (assuming it was originally a string). The StringIndexer will fail if the column it is applied to has nulls. This means that the label column must already have had any null values replaced with some special string indicating null. Given this, its probably OK for the MDLPDiscretizer code just to throw an error when null values are present. From a machine learning point of view, I'm not sure if the null values need to be treated as a special value. If they do, then we might need a param that allows specifying the integer index of the null value.