pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.54k stars 1.88k forks source link

`hist` panics after creating zero bins #18650

Open alexander-beedie opened 3 weeks ago

alexander-beedie commented 3 weeks ago

Checks

@mcrumiller: think you most recently looked at / worked on hist; care to take a look? :)

Reproducible example

import polars as pl

df = pl.DataFrame({"value": [0.1, 0.2, 0.3]})
df["value"].hist()

Log output

PanicException: index out of bounds: the len is 0 but the index is 0

Issue description

Looks like #16942 inadvertently makes it quite easy to create zero bin counts against small numbers (any case where the start/end diff rounds down here):

https://github.com/pola-rs/polars/blob/1ee6a8211ffa63cef68bac08ec4cc0a6c47e8ac7/crates/polars-ops/src/chunked_array/hist.rs#L79

And, if we specify a reasonable number of bins explicitly, we now get some odd-looking boundaries that are scaled outside the actual data range (and the wrong number of bins):

df["value"].hist(bins=3)
# shape: (2, 3)
# ┌────────────┬─────────────┬───────┐
# │ breakpoint ┆ category    ┆ count │
# │ ---        ┆ ---         ┆ ---   │
# │ f64        ┆ cat         ┆ u32   │
# ╞════════════╪═════════════╪═══════╡
# │ 3.0        ┆ (-inf, 3.0] ┆ 3     │
# │ inf        ┆ (3.0, inf]  ┆ 0     │
# └────────────┴─────────────┴───────┘

The equivalent bins from pandas cut look like this; right number of bins, bracketing the appropriate data range:

import pandas as pd
pd.cut(df["value"].to_pandas(), bins=3).to_frame()
#              value
# 0  (0.0998, 0.167]
# 1   (0.167, 0.233]
# 2     (0.233, 0.3]

Expected behavior

I think the intent of #16942 was to clear up a few bin creation issues and better match the pandas behaviour around boundary creation.

Installed versions

``` --------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.12.5 (main, Aug 6 2024, 19:08:49) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager 1.2.0 altair 5.4.1 cloudpickle 3.0.0 connectorx 0.3.3 deltalake 0.19.2 fastexcel 0.11.6 fsspec 2024.9.0 gevent 24.2.1 great_tables 0.11.0 matplotlib 3.9.2 nest_asyncio 1.6.0 numpy 2.0.2 openpyxl 3.1.5 pandas 2.2.2 pyarrow 17.0.0 pydantic 2.9.0 pyiceberg sqlalchemy 2.0.34 torch xlsx2csv 0.8.3 xlsxwriter 3.2.0 ```
mcrumiller commented 3 weeks ago

Yup, will take a look.

mcrumiller commented 3 weeks ago

Ok; I had earlier replicated pandas' binning when bins were supplied. Our default bin constructor has always been to use (end - start).round(), i.e. use unit bins, which IMO doesn't really make sense because it doesn't take into account the data.

@alexander-beedie do you think simply mimicking pandas' implementation for our hist function is the way to go here? I set up a hypothesis test to compare to pandas which should help me get this right.

Edit: nevermind, pd.cut requires the bins, but hist shouldn't. Pandas uses matplotlib.pyplot.hist() which uses numpy.hist() so we can use that as a template. I'll have to get to this later today after work.

alexander-beedie commented 3 weeks ago

Ok; I had earlier replicated pandas' binning when bins were supplied. Our default bin constructor has always been to use (end - start).round(), i.e. use unit bins, which IMO doesn't really make sense because it doesn't take into account the data.

Indeed; automatic bin count should definitely account for the data being binned.

Edit: nevermind, pd.cut requires the bins, but hist shouldn't. Pandas uses matplotlib.pyplot.hist() which uses numpy.hist() so we can use that as a template. I'll have to get to this later today after work.

Sounds good ✌️

mcrumiller commented 2 weeks ago

So I have a pretty good implementation here, but I kind of don't like pandas' approach with the lowering the bottom bin by 0.1% to include the left-most value. It means making the bins non-uniformly distributed, which is ugly.

@alexander-beedie how do you feel about adding an include_lower flag that says "the bottom edge is inclusive," essentially making the first interval fully closed, followed by a series of half-open intervals? In other words:

include_lower=False

interval   (0, 1]   (1, 2]   (2, 3]   (3, 4]
bins     0--------1--------2--------3--------4
               ^       ^        ^
data        0  1       2        3

include_lower=True

interval   [0, 1]   (1, 2]   (2, 3]   (3, 4]
bins     0--------1--------2--------3--------4
            ^  ^       ^        ^
data        0  1       2        3

I would make this True by default, since calling data.hist(bin_count=x) should include all data.

mcrumiller commented 2 weeks ago

Another reason for this is that there is a fast implementation when bins are equally spaced. But if the user supplies equally-spaced bins, and the left-most item lies on the left-most edge, then the item will be excluded.

mcrumiller commented 2 weeks ago

I've decided to hold off on any new parameters, but the PR is ready for review now.