sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Why are the column(s) to bin required to be a vector? #10

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

Hi Sergio, I wanted to better understand the motivation for making the inputColumn a vector. Is it because it is much more efficient to process all columns to be binned in one pass rather than apply the Discretizer separately to each column? A related question I have is - could you get different column splits if you apply it all at once compared to one at a time? IOW, is there interaction between features that are being binned? The QuantileDiscretize (equal-weight discretizer) only requires a single inputColumn. I was hoping that the two discretizers would operate similarly. I thought about adding a fitToBucketizers method on MDLPDistretizer so that I could easily get the list of Bucketizers corresponding to the input feature column but that did not work well as I did not have the names of the columns in the feature vector. I will do it outside in client code instead.

sramirez commented 8 years ago

As you said it's much more efficient to process all feature at the same time. My design tries to process one feature per partition (in those cases where the number of distinct points allows this kind of processing), which is much more efficient. I know that this kind of processing is quite different from the design of QuantileDiscretizer, but I don't agree with it. For instance, if we have to implement a multivariate discretizer that takes into account the relationship between features, the latter design won't fit. What we can do it's to offer both alternatives. So, I'll check your code to merge this PR.