sisl / Discretizers.jl

A Julia package for data discretization and label maps
Other
18 stars 15 forks source link

The "binedges non-unique" error #33

Open xukai92 opened 2 years ago

xukai92 commented 2 years ago

I met the "binedges non-unique" error from https://github.com/sisl/Discretizers.jl/blob/master/src/disc_uniformcount.jl#L49. I saw there is a to-do note on making the algorithm properly handle that---can someone explain what needs to be done for that? I'm happy to give a try if someone gives me some guidence---thanks!

tawheeler commented 2 years ago

Thanks for volunteering!

There are multiple ways this could be handled. The method in question discretizes such that each bin has roughly the same number of items within it. However, things get tricky when boundaries happen to fall on repeated samples.

Consider trying to discretize the following into two bins with uniform counts: [0, 1, 1, 1, 1, 1, 2, 2] In this case, we probably want bins [0,1] and (1,2] to get 6 and 2 items per bin. However, for the following: [0, 0.5, 1, 1, 1, 1, 1, 2] we probably want [0,0.5] and [1,2] to get 2 and 6 items per bin.

There probably is a nice way to handle this where some initial binning is done and then a process iterates to a better binning by narrowing bins with many elements and widening bins with very few elements.

xukai92 commented 2 years ago

Thanks for the explanation! I understand the problem better now!

I took a look at ScikitLearn's implementation and it looks like that they simply removes those repeated edges; see https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/preprocessing/_discretization.py#L226.

Also, their main implementation is much easier: https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/preprocessing/_discretization.py#L204 (they refer to uniform count as "quantile" as it just doing uniform width but on the quantile)---have we considered take that implementation? If yes why did we choose the current one (what's the benefit)?

xukai92 commented 2 years ago

Another way that I just read to handle this while still achieving uniform count is by random binning (see https://bkamins.github.io/julialang/2020/12/11/binning.html) but it looks like not compatible with out current pipeline as it doesn't get binedges first.

tawheeler commented 2 years ago

You are referring to this?

# Remove bins whose width are too small (i.e., <= 1e-8)
            if self.strategy in ("quantile", "kmeans"):

It looks like fit for scikit-learn decided the number of bins for you. That code there is removing bins that are too small. The code for us receives the number of bins you want as an input - that is different.

If you want a method that decides on the number of bins for you, then maybe consider BayesianBlocks.

There are a lot of possible discretization methods - feel free to submit a PR for anything new if you want the functionality.

xukai92 commented 2 years ago

It looks like fit for scikit-learn decided the number of bins for you.

Yes and no. The fit function only changes the required number of bins if this "binedges non-unique" situation happens, otherwise it performs with the specified number of bins. For reference their implementation of uniformcount equivalence in Julia to compute binedges with nbins is:

qs = range(0, 1; length=nbins)
binedges = quantile.(Ref(x), qs)

This implementation looks neat for me and I wonder what the advantage of the one we have in our code base is as it seems doing something more complex.

PS: In my view, when the use the "post-processing" of "remove bins whose width are too small (i.e., <= 1e-8)", users' input is like the maximum number of bins, which is kind of a sensible thing to do as default.

tawheeler commented 2 years ago

Using the builtin quantile sounds great. While what goes on there is more complex than what we do, the code in the codebase approximates that. If we can avoid binning problems, then I'd say it is a fine adjustment.

quantiles = range(0, 1, length=nbins+1)
binedges = Statistics.quantile!(data, quantiles)

If you want to include an option for trimming bins, that's fine. I wouldn't do it by default in order to preserve backwards compatibility, but I am not too opinionated there.

xukai92 commented 2 years ago

I can do a PR with the new implementaiton with an option for trimming. Whether or not backwards compatibility is possible on DiscretizeUniformCount is a question though---the new implementaiton actually generates quite different binedges, e.g. binedges(DiscretizeUniformCount(2), [1, 2]) was [1, 2, 2] and would become [1.0, 1.5, 2.0]. Looks like it's better to add a new struct called DiscretizeQuantileCount---how do you think?

tawheeler commented 2 years ago

Right - ideally a uniform count algorithm for 3 bins on the data [1,2,3,4,5,6,6,6,6,6,6,6,6,6] would give something like [1,3], [4,5], [6,6+eps()] and not [1,5] [6+eps] the way the quantile method would. A DiscretizeQuantile struct sounds great. Thank you!

xukai92 commented 2 years ago

PR https://github.com/sisl/Discretizers.jl/pull/34 created---plz review.