Need way to control split stopping criteria

barrybecker4 commented 8 years ago

Currently the stopping criteria appears to be quite conservative. For example, if I set maxBins to 1000, it will usually just generate 2 - 5 bins. Why doesn't it generate 8 or 10 instead? I think there should be another parameter that allows the user to specify the stopping criteria. I realize that there will be diminishing returns for generating more bins, but it seems that in many cases it would be better to go further than its currently doing. The combination of both maxBins and a parameter to control when to stop splitting could give the user a lot of control over the number of bins that get as a result.

barrybecker4 commented 8 years ago

Looking through the code, I discovered the constant that controls the stopping criterion. The same constant (-1-e-5) appears in two places in the 2 evalThresholds methods: var criterion = (gain - (log2(s - 1) + delta) / s) > -1e-5 When I changed this constant to something like -1e-3 or -1e-2, I started getting more splits (which is what I wanted). How was the -1e-5 determined? It seems rather arbitrary. I need to read the paper to understand its meaning.

Here is what I propose:

lets add a new (optional) stoppingCriterion param for the discretizer. If not specified, it can default to something like -1e-5 or -1e-5
There is a lot of code duplicated between the two evalThresholds methods. At a minimum I think the last part of each method could be pulled out into a common method, but I was thinking that we might actually want to make ThresholdFinder classes for the two different ways of finding thresholds. There could be ManyValuesThresholdFinder and FewValuesThresholdFinder classes that both extend a base ThresholdFinder trait. One difference between the two implementations surprised me and I'm wondering if it might be the result of fixing something in one copy but not the other. The many values version of evalThresholds has val s1 = leftFreqs.sum while the few values impl has val s1 = if (k1 > 0) leftFreqs.sum else 0. Was that intentional?

barrybecker4 commented 8 years ago

I read the paper and did some experimentation. The paper said that you should make the split if the MDL criterion value is greater than 0. It seems that in practice that may result in too few splits (I'm guessing) so -1e-5 was used instead of 0. You get even more splits if a value like -1e-3 is used. Using more splits should get more accuracy, but with diminishing returns. I don't see how it could ever get worse (but might overfit). I guess the value to use may vary with the use case, so I think it should be a parameter. The name could be something like stoppingCriterion.

sramirez commented 8 years ago

That's what I've seen in Weka version of MDLP:

gain > (Utils.log2(numCutPoints) + delta) / numInstances

So we should use 0 strictly if we want to follow the original designed created by Fayyad and Irani. About the reorganization of code, I agree with you on extracting the duplicated code.

barrybecker4 commented 8 years ago

We could have the default value of the stoppingCriterion be 0 and say in the documentation that that is the value suggested in the original paper. However, setting smaller values like -1e-4 or -1e-2 relaxes it so that more bins are likely to be produced.

barrybecker4 commented 8 years ago

I am working on this in https://github.com/barrybecker4/spark-MDLP-discretization/tree/ISSUE-12-stopping-criteria-bb I will create a PR later today or tomorrow.

sramirez / spark-MDLP-discretization

Need way to control split stopping criteria #12