Sometimes buckets with too few instances are generated

sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)

Apache License 2.0

44 stars 27 forks source link

Sometimes buckets with too few instances are generated #19

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

I think we should consider adding a param to limit the minimum number of instances in a bucket. I have seen cases where there is one huge bucket with most of the data - like shown in this spinogram too_many_bins100 . Here I have set the maxBins to 100. The problem is not so much that there are too many bins, as that there were many bins generated that had just a few instances in them. A reasonable default for such a new "minBinPercentage" might be like 0.1% of the total number of records. In other words, we would never split bins that have less than "minBinPercentage" of the instances of the whole dataset in them.

barrybecker4 commented 8 years ago

Maybe instead of having a parameter, the limit could implicitly be 10/maxBins %

barrybecker4 commented 8 years ago

I made a change on the branch which prevents buckets from having (10/maxBins)% of the data in them. It may be hard to tell by this image, but the binning seems much better to me now. too_small_bins100_after None of the bins now have less thatn 81 records (which is 0.1% of the 81,000 row dataset it was run on with maxBins = 100).

barrybecker4 commented 8 years ago

When I lower maxBins to 50, it becomes more readily apparent in the image that having a lower bound on the bin weight is helping. too_small_bins50_after

The (10/maxBins)% formula means that no bin will ever be smaller than 1/10 the size of a bin in an equalWeight binning of the same data with the name number of bins specified.

sramirez commented 8 years ago

I think this change could be omitted because of two reasons:

MDLP belongs to the non-parametric family of discretizers, unlike from algorithms like EqualFrequency or EqualWidth. Then, adding a new parameter could break the original idea of the authors.
By introducing this parameter the heuristic search included in MDLP (guided by entropy) may be affected.

Actually, there is no rule about minimum size in the original paper.

barrybecker4 commented 8 years ago

OK. From a practical and visual standpoint it seems useful, but I have not done tests to see if it improves accuracy of a naive bayes classifier. Perhaps there could be a param to set the min threshold that would default to 0? That way the idea from original authors could be maintained while allowing people to use it/experiment with it if they wanted?

sramirez commented 8 years ago

OK, but it seems that the default isn't set yet in the last PR.

barrybecker4 commented 8 years ago

Right. I have not added it yet. I will make the changes on that PR/branch to add a minBinPercentage param. The default value of the param will be 0 - meaning that even a bin with 1 record is OK. The client code could then adjust it to whatever min percentage they wanted. 0.1% may be reasonable, but it would depend somewhat on the maxBins that they use.

barrybecker4 commented 8 years ago

I updated the PR. I Added an explicit minBinPercentage parameter (with default of 0%) instead of calculating it implicitly.