pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.16k stars 1.94k forks source link

qcut throw PanicException when all None or nan #10876

Open wukan1986 opened 1 year ago

wukan1986 commented 1 year ago

Checks

Reproducible example

import numpy as np
import polars as pl

df = pl.DataFrame({
    'a': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
    'b': [np.nan, np.nan, np.nan, np.nan, np.nan, 5, 6, 7, 8, 9],
})
print(df)

def func(df: pl.DataFrame):
    df = df.with_columns([
        pl.col('b').qcut(10).to_physical().alias('c')
    ])
    return df

out = df.group_by('a', maintain_order=True).map_groups(func)
print(out)
thread '<unnamed>' panicked at D:\a\polars\polars\crates\polars-ops\src\series\ops\cut.rs:115:54:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
Traceback (most recent call last):
  File "D:\GitHub\my_quant\tests\tt2.py", line 12, in func
    df = df.with_columns([
         ^^^^^^^^^^^^^^^^^
  File "D:\Users\Kan\miniconda3\envs\py311_1\Lib\site-packages\polars\dataframe\frame.py", line 7844, in with_columns
    .collect(no_optimization=True)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\Kan\miniconda3\envs\py311_1\Lib\site-packages\polars\utils\deprecation.py", line 95, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\Kan\miniconda3\envs\py311_1\Lib\site-packages\polars\lazyframe\frame.py", line 1695, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value
Traceback (most recent call last):
  File "D:\GitHub\my_quant\tests\tt2.py", line 18, in <module>
    out = df.group_by('a', maintain_order=True).map_groups(func)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\Kan\miniconda3\envs\py311_1\Lib\site-packages\polars\dataframe\group_by.py", line 323, in map_groups
    self.df._df.group_by_map_groups(by, function, self.maintain_order)
pyo3_runtime.PanicException: ('called `Option::unwrap()` on a `None` value',)

Issue description

qcut throw PanicException when all None or nan

Expected behavior

keep None or nan, and to_physical is -1

Installed versions

``` --------Version info--------- Polars: 0.19.0 Index type: UInt32 Platform: Windows-10-10.0.22621-SP0 Python: 3.11.3 | packaged by Anaconda, Inc. | (main, Apr 19 2023, 23:46:34) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: matplotlib: 3.7.1 numpy: 1.24.3 pandas: 2.0.1 pyarrow: 12.0.0 pydantic: 1.10.7 sqlalchemy: 2.0.13 xlsx2csv: xlsxwriter: ```
orlp commented 1 year ago

I'm planning to tackle this in the rework of qcut to bin_quantiles (for more info see here: https://github.com/pola-rs/polars/issues/10468) by relying on a total ordering of the floats (https://doc.rust-lang.org/std/primitive.f64.html#method.total_cmp). That said, I think it would be wasted work to still fix this in the soon to be deprecated qcut.

dovahcrow commented 9 months ago

It also panics when the df is empty.

Sage0614 commented 6 months ago

now on all Null columns instead of panic it gives an Error:

df = pl.DataFrame({"test": [None]})
df.with_columns(pl.col("test").qcut(5, labels=["q1", "q2", "q3", "q4", "q5"]))
  File "/home/swang/.pyenv/versions/3.11.4/lib/python3.11/site-packages/polars/dataframe/frame.py", line 7872, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swang/.pyenv/versions/3.11.4/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1700, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
polars.exceptions.ShapeError: Provide nbreaks + 1 labels
yarimiz commented 1 month ago

Had anyone found a workaround this issue? I'm facing the same thing with a qcut().over() when all values are null: pl.col(f).qcut(quantiles=3, labels=["0", "2", "4"], allow_duplicates=True).over("date")

yarimiz commented 1 month ago

After some investigation I found out that the panic happens when you provide labels to the qcut function while all data is null.

There's already a test for full null data, but it doesn't check with labels as input:

# this is the existing test
def test_qcut_full_null() -> None:
    s = pl.Series("a", [None, None, None, None])

    result = s.qcut([0.25, 0.50])

    expected = pl.Series("a", [None, None, None, None], dtype=pl.Categorical)
    assert_series_equal(result, expected, categorical_as_str=True)

# the new one - it fails
def test_qcut_full_null_with_labels() -> None:
    s = pl.Series("a", [None, None, None, None])

    result = s.qcut([0.25, 0.50], labels=["1", "2", "3"])

    expected = pl.Series("a", [None, None, None, None], dtype=pl.Categorical)
    assert_series_equal(result, expected, categorical_as_str=True)

The test_qcut_full_null_with_labels fails due to the same error mentioned in this issue:

FAILED tests/unit/operations/test_qcut.py::test_qcut_full_null_with_labels - polars.exceptions.ShapeError: provide len(quantiles) + 1 labels

The spcific line out code that is causing the error is crates/polars-ops/src/series/ops/cut.rs#116 polars_ensure!(l.len() == breaks.len() + 1, ShapeMismatch: "provide len(quantiles) + 1 labels");

I'll try to fix it myself but I'm not really a rust guy.