Closed barrybecker4 closed 8 years ago
I opened a very similar bug against the QuantileDiscretizer in spark. Whatever they decide to do there should probably be done here too. See https://issues.apache.org/jira/browse/SPARK-17219
OK, I think I'm going to wait for the solution in this issue to imitate it.
What do we do if the column to be binned contains null (i.e. NaN)? The resulting binning currently does not say anything about nulls, but I think it needs to. There should probably be a separate null bin. Not sure exactly how to do this. From my experience with QuantileDiscretizer, it seems to be adding NaN splits to the end of the list of splits, but I'm not sure that is right either. Maybe the first (or last) split could be NaN if any of the values in the column are NaN. Or maybe there should always be a NaN split at beginning (or end) because future data may have NaN even if the training/fitting data did not.