Open alexander-beedie opened 3 weeks ago
Yup, will take a look.
Ok; I had earlier replicated pandas' binning when bins were supplied. Our default bin constructor has always been to use (end - start).round()
, i.e. use unit bins, which IMO doesn't really make sense because it doesn't take into account the data.
@alexander-beedie do you think simply mimicking pandas' implementation for our hist
function is the way to go here? I set up a hypothesis test to compare to pandas which should help me get this right.
Edit: nevermind, pd.cut
requires the bins
, but hist
shouldn't. Pandas uses matplotlib.pyplot.hist()
which uses numpy.hist()
so we can use that as a template. I'll have to get to this later today after work.
Ok; I had earlier replicated pandas' binning when bins were supplied. Our default bin constructor has always been to use
(end - start).round()
, i.e. use unit bins, which IMO doesn't really make sense because it doesn't take into account the data.
Indeed; automatic bin count should definitely account for the data being binned.
Edit: nevermind, pd.cut requires the bins, but hist shouldn't. Pandas uses matplotlib.pyplot.hist() which uses numpy.hist() so we can use that as a template. I'll have to get to this later today after work.
Sounds good ✌️
So I have a pretty good implementation here, but I kind of don't like pandas' approach with the lowering the bottom bin by 0.1% to include the left-most value. It means making the bins non-uniformly distributed, which is ugly.
@alexander-beedie how do you feel about adding an include_lower
flag that says "the bottom edge is inclusive," essentially making the first interval fully closed, followed by a series of half-open intervals? In other words:
include_lower=False
interval (0, 1] (1, 2] (2, 3] (3, 4]
bins 0--------1--------2--------3--------4
^ ^ ^
data 0 1 2 3
include_lower=True
interval [0, 1] (1, 2] (2, 3] (3, 4]
bins 0--------1--------2--------3--------4
^ ^ ^ ^
data 0 1 2 3
I would make this True by default, since calling data.hist(bin_count=x)
should include all data.
Another reason for this is that there is a fast implementation when bins are equally spaced. But if the user supplies equally-spaced bins, and the left-most item lies on the left-most edge, then the item will be excluded.
I've decided to hold off on any new parameters, but the PR is ready for review now.
Checks
@mcrumiller: think you most recently looked at / worked on
hist
; care to take a look? :)Reproducible example
Log output
Issue description
Looks like #16942 inadvertently makes it quite easy to create zero bin counts against small numbers (any case where the start/end diff rounds down here):
https://github.com/pola-rs/polars/blob/1ee6a8211ffa63cef68bac08ec4cc0a6c47e8ac7/crates/polars-ops/src/chunked_array/hist.rs#L79
And, if we specify a reasonable number of bins explicitly, we now get some odd-looking boundaries that are scaled outside the actual data range (and the wrong number of bins):
The equivalent bins from pandas
cut
look like this; right number of bins, bracketing the appropriate data range:Expected behavior
I think the intent of #16942 was to clear up a few bin creation issues and better match the pandas behaviour around boundary creation.
Installed versions