Closed barrybecker4 closed 8 years ago
Looking through the code, I discovered the constant that controls the stopping criterion.
The same constant (-1-e-5) appears in two places in the 2 evalThresholds methods:
var criterion = (gain - (log2(s - 1) + delta) / s) > -1e-5
When I changed this constant to something like -1e-3 or -1e-2, I started getting more splits (which is what I wanted). How was the -1e-5 determined? It seems rather arbitrary. I need to read the paper to understand its meaning.
Here is what I propose:
val s1 = leftFreqs.sum
while the few values impl has val s1 = if (k1 > 0) leftFreqs.sum else 0
. Was that intentional?I read the paper and did some experimentation. The paper said that you should make the split if the MDL criterion value is greater than 0. It seems that in practice that may result in too few splits (I'm guessing) so -1e-5 was used instead of 0. You get even more splits if a value like -1e-3 is used. Using more splits should get more accuracy, but with diminishing returns. I don't see how it could ever get worse (but might overfit). I guess the value to use may vary with the use case, so I think it should be a parameter. The name could be something like stoppingCriterion.
That's what I've seen in Weka version of MDLP:
gain > (Utils.log2(numCutPoints) + delta) / numInstances
So we should use 0 strictly if we want to follow the original designed created by Fayyad and Irani. About the reorganization of code, I agree with you on extracting the duplicated code.
We could have the default value of the stoppingCriterion be 0 and say in the documentation that that is the value suggested in the original paper. However, setting smaller values like -1e-4 or -1e-2 relaxes it so that more bins are likely to be produced.
I am working on this in https://github.com/barrybecker4/spark-MDLP-discretization/tree/ISSUE-12-stopping-criteria-bb I will create a PR later today or tomorrow.
Currently the stopping criteria appears to be quite conservative. For example, if I set maxBins to 1000, it will usually just generate 2 - 5 bins. Why doesn't it generate 8 or 10 instead? I think there should be another parameter that allows the user to specify the stopping criteria. I realize that there will be diminishing returns for generating more bins, but it seems that in many cases it would be better to go further than its currently doing. The combination of both maxBins and a parameter to control when to stop splitting could give the user a lot of control over the number of bins that get as a result.