sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Sometimes buckets with too few instances are generated #19

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

I think we should consider adding a param to limit the minimum number of instances in a bucket. I have seen cases where there is one huge bucket with most of the data - like shown in this spinogram too_many_bins100 . Here I have set the maxBins to 100. The problem is not so much that there are too many bins, as that there were many bins generated that had just a few instances in them. A reasonable default for such a new "minBinPercentage" might be like 0.1% of the total number of records. In other words, we would never split bins that have less than "minBinPercentage" of the instances of the whole dataset in them.

barrybecker4 commented 8 years ago

Maybe instead of having a parameter, the limit could implicitly be 10/maxBins %

barrybecker4 commented 8 years ago

I made a change on the branch which prevents buckets from having (10/maxBins)% of the data in them. It may be hard to tell by this image, but the binning seems much better to me now. too_small_bins100_after None of the bins now have less thatn 81 records (which is 0.1% of the 81,000 row dataset it was run on with maxBins = 100).

barrybecker4 commented 8 years ago

When I lower maxBins to 50, it becomes more readily apparent in the image that having a lower bound on the bin weight is helping. too_small_bins50_after

The (10/maxBins)% formula means that no bin will ever be smaller than 1/10 the size of a bin in an equalWeight binning of the same data with the name number of bins specified.

sramirez commented 8 years ago

I think this change could be omitted because of two reasons:

Actually, there is no rule about minimum size in the original paper.

barrybecker4 commented 8 years ago

OK. From a practical and visual standpoint it seems useful, but I have not done tests to see if it improves accuracy of a naive bayes classifier. Perhaps there could be a param to set the min threshold that would default to 0? That way the idea from original authors could be maintained while allowing people to use it/experiment with it if they wanted?

sramirez commented 8 years ago

OK, but it seems that the default isn't set yet in the last PR.

barrybecker4 commented 8 years ago

Right. I have not added it yet. I will make the changes on that PR/branch to add a minBinPercentage param. The default value of the param will be 0 - meaning that even a bin with 1 record is OK. The client code could then adjust it to whatever min percentage they wanted. 0.1% may be reasonable, but it would depend somewhat on the maxBins that they use.

barrybecker4 commented 8 years ago

I updated the PR. I Added an explicit minBinPercentage parameter (with default of 0%) instead of calculating it implicitly.