Why are the column(s) to bin required to be a vector?

sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)

Apache License 2.0

44 stars 27 forks source link

Hi Sergio, I wanted to better understand the motivation for making the inputColumn a vector. Is it because it is much more efficient to process all columns to be binned in one pass rather than apply the Discretizer separately to each column? A related question I have is - could you get different column splits if you apply it all at once compared to one at a time? IOW, is there interaction between features that are being binned? The QuantileDiscretize (equal-weight discretizer) only requires a single inputColumn. I was hoping that the two discretizers would operate similarly. I thought about adding a fitToBucketizers method on MDLPDistretizer so that I could easily get the list of Bucketizers corresponding to the input feature column but that did not work well as I did not have the names of the columns in the feature vector. I will do it outside in client code instead.

sramirez / spark-MDLP-discretization

Why are the column(s) to bin required to be a vector? #10