Closed barrybecker4 closed 8 years ago
Maybe instead of having a parameter, the limit could implicitly be 10/maxBins %
I made a change on the branch which prevents buckets from having (10/maxBins)% of the data in them. It may be hard to tell by this image, but the binning seems much better to me now. None of the bins now have less thatn 81 records (which is 0.1% of the 81,000 row dataset it was run on with maxBins = 100).
When I lower maxBins to 50, it becomes more readily apparent in the image that having a lower bound on the bin weight is helping.
The (10/maxBins)% formula means that no bin will ever be smaller than 1/10 the size of a bin in an equalWeight binning of the same data with the name number of bins specified.
I think this change could be omitted because of two reasons:
Actually, there is no rule about minimum size in the original paper.
OK. From a practical and visual standpoint it seems useful, but I have not done tests to see if it improves accuracy of a naive bayes classifier. Perhaps there could be a param to set the min threshold that would default to 0? That way the idea from original authors could be maintained while allowing people to use it/experiment with it if they wanted?
OK, but it seems that the default isn't set yet in the last PR.
Right. I have not added it yet. I will make the changes on that PR/branch to add a minBinPercentage param. The default value of the param will be 0 - meaning that even a bin with 1 record is OK. The client code could then adjust it to whatever min percentage they wanted. 0.1% may be reasonable, but it would depend somewhat on the maxBins that they use.
I updated the PR. I Added an explicit minBinPercentage parameter (with default of 0%) instead of calculating it implicitly.
I think we should consider adding a param to limit the minimum number of instances in a bucket. I have seen cases where there is one huge bucket with most of the data - like shown in this spinogram . Here I have set the maxBins to 100. The problem is not so much that there are too many bins, as that there were many bins generated that had just a few instances in them. A reasonable default for such a new "minBinPercentage" might be like 0.1% of the total number of records. In other words, we would never split bins that have less than "minBinPercentage" of the instances of the whole dataset in them.