Open xukai92 opened 2 years ago
Thanks for volunteering!
There are multiple ways this could be handled. The method in question discretizes such that each bin has roughly the same number of items within it. However, things get tricky when boundaries happen to fall on repeated samples.
Consider trying to discretize the following into two bins with uniform counts:
[0, 1, 1, 1, 1, 1, 2, 2]
In this case, we probably want bins [0,1] and (1,2] to get 6 and 2 items per bin.
However, for the following:
[0, 0.5, 1, 1, 1, 1, 1, 2]
we probably want [0,0.5] and [1,2] to get 2 and 6 items per bin.
There probably is a nice way to handle this where some initial binning is done and then a process iterates to a better binning by narrowing bins with many elements and widening bins with very few elements.
Thanks for the explanation! I understand the problem better now!
I took a look at ScikitLearn's implementation and it looks like that they simply removes those repeated edges; see https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/preprocessing/_discretization.py#L226.
Also, their main implementation is much easier: https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/preprocessing/_discretization.py#L204 (they refer to uniform count as "quantile" as it just doing uniform width but on the quantile)---have we considered take that implementation? If yes why did we choose the current one (what's the benefit)?
Another way that I just read to handle this while still achieving uniform count is by random binning (see https://bkamins.github.io/julialang/2020/12/11/binning.html) but it looks like not compatible with out current pipeline as it doesn't get binedges first.
You are referring to this?
# Remove bins whose width are too small (i.e., <= 1e-8)
if self.strategy in ("quantile", "kmeans"):
It looks like fit
for scikit-learn decided the number of bins for you. That code there is removing bins that are too small. The code for us receives the number of bins you want as an input - that is different.
If you want a method that decides on the number of bins for you, then maybe consider BayesianBlocks.
There are a lot of possible discretization methods - feel free to submit a PR for anything new if you want the functionality.
It looks like
fit
for scikit-learn decided the number of bins for you.
Yes and no.
The fit
function only changes the required number of bins if this "binedges non-unique" situation happens, otherwise it performs with the specified number of bins.
For reference their implementation of uniformcount
equivalence in Julia to compute binedges
with nbins
is:
qs = range(0, 1; length=nbins)
binedges = quantile.(Ref(x), qs)
This implementation looks neat for me and I wonder what the advantage of the one we have in our code base is as it seems doing something more complex.
PS: In my view, when the use the "post-processing" of "remove bins whose width are too small (i.e., <= 1e-8)", users' input is like the maximum number of bins, which is kind of a sensible thing to do as default.
Using the builtin quantile
sounds great. While what goes on there is more complex than what we do, the code in the codebase approximates that. If we can avoid binning problems, then I'd say it is a fine adjustment.
quantiles = range(0, 1, length=nbins+1)
binedges = Statistics.quantile!(data, quantiles)
If you want to include an option for trimming bins, that's fine. I wouldn't do it by default in order to preserve backwards compatibility, but I am not too opinionated there.
I can do a PR with the new implementaiton with an option for trimming.
Whether or not backwards compatibility is possible on DiscretizeUniformCount
is a question though---the new implementaiton actually generates quite different binedges
, e.g.
binedges(DiscretizeUniformCount(2), [1, 2])
was [1, 2, 2]
and would become [1.0, 1.5, 2.0]
.
Looks like it's better to add a new struct called DiscretizeQuantileCount
---how do you think?
Right - ideally a uniform count algorithm for 3 bins on the data [1,2,3,4,5,6,6,6,6,6,6,6,6,6] would give something like [1,3], [4,5], [6,6+eps()] and not [1,5] [6+eps] the way the quantile method would. A DiscretizeQuantile
struct sounds great. Thank you!
PR https://github.com/sisl/Discretizers.jl/pull/34 created---plz review.
I met the "binedges non-unique" error from https://github.com/sisl/Discretizers.jl/blob/master/src/disc_uniformcount.jl#L49. I saw there is a to-do note on making the algorithm properly handle that---can someone explain what needs to be done for that? I'm happy to give a try if someone gives me some guidence---thanks!