sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Need way to control split stopping criteria #12

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

Currently the stopping criteria appears to be quite conservative. For example, if I set maxBins to 1000, it will usually just generate 2 - 5 bins. Why doesn't it generate 8 or 10 instead? I think there should be another parameter that allows the user to specify the stopping criteria. I realize that there will be diminishing returns for generating more bins, but it seems that in many cases it would be better to go further than its currently doing. The combination of both maxBins and a parameter to control when to stop splitting could give the user a lot of control over the number of bins that get as a result.

barrybecker4 commented 8 years ago

Looking through the code, I discovered the constant that controls the stopping criterion. The same constant (-1-e-5) appears in two places in the 2 evalThresholds methods: var criterion = (gain - (log2(s - 1) + delta) / s) > -1e-5 When I changed this constant to something like -1e-3 or -1e-2, I started getting more splits (which is what I wanted). How was the -1e-5 determined? It seems rather arbitrary. I need to read the paper to understand its meaning.

Here is what I propose:

barrybecker4 commented 8 years ago

I read the paper and did some experimentation. The paper said that you should make the split if the MDL criterion value is greater than 0. It seems that in practice that may result in too few splits (I'm guessing) so -1e-5 was used instead of 0. You get even more splits if a value like -1e-3 is used. Using more splits should get more accuracy, but with diminishing returns. I don't see how it could ever get worse (but might overfit). I guess the value to use may vary with the use case, so I think it should be a parameter. The name could be something like stoppingCriterion.

sramirez commented 8 years ago

That's what I've seen in Weka version of MDLP:

gain > (Utils.log2(numCutPoints) + delta) / numInstances

So we should use 0 strictly if we want to follow the original designed created by Fayyad and Irani. About the reorganization of code, I agree with you on extracting the duplicated code.

barrybecker4 commented 8 years ago

We could have the default value of the stoppingCriterion be 0 and say in the documentation that that is the value suggested in the original paper. However, setting smaller values like -1e-4 or -1e-2 relaxes it so that more bins are likely to be produced.

barrybecker4 commented 8 years ago

I am working on this in https://github.com/barrybecker4/spark-MDLP-discretization/tree/ISSUE-12-stopping-criteria-bb I will create a PR later today or tomorrow.