pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.35k stars 1.86k forks source link

`LazyFrame.collect_schema()` cannot resolve the column type after application of a `numpy` 'ufunc' #17422

Open trendelkampschroer opened 2 months ago

trendelkampschroer commented 2 months ago

Checks

Reproducible example

frame = pl.from_dict({"A": [1.0, 2.0, 3.0]}).lazy().with_columns(np.expm1(pl.col("A")))
frame.collect_schema()
>>>Schema([('A', Unknown)])

frame = pl.from_dict({"A": [1.0, 2.0, 3.0]}).lazy().with_columns(pl.col("A").exp() - 1.0)
frame.collect_schema()
>>>Schema([('A', Float64)])

Log output

No response

Issue description

The unknown type leads to an exception in the following rolling + group_by aggregation:


frame = pl.from_dict({
        "date": ["2001-01-01", "2001-01-02", "2001-01-03"] * 2,
        "group": ["A"] * 3 + ["B"] * 3,
        "value": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
    }).with_columns(pl.col("date").str.to_date())

result = frame.lazy().rolling(
        index_column="date",
        group_by="group",
        period="2d",
    ).agg([
        pl.when(pl.col("value").is_not_null().all()).then(np.expm1(pl.col("value").log1p().sum())).alias(f"{agg}")
        for agg in ["foo", "bar", "egg"]
    ])
result.collect_schema()
>>>thread '' panicked at py-polars/src/conversion/mod.rs:241:39:
called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("cannot parse input of type 'Unknown' into Polars data type: Unknown"), traceback: Some(<traceback object at 0x175f19840>) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File ".../frame.py", line 118, in <module>
    result.collect_schema()
  File ".../lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2146, in collect_schema
    return Schema(self._ldf.collect_schema())
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'TypeError'>, value: TypeError("cannot parse input of type 'Unknown' into Polars data type: Unknown"), traceback: Some(<traceback object at 0x175f19840>) }

When I remove the conditional from the aggregation then the error is gone, i.e. the following works

result = frame.lazy().rolling(
        index_column="date",
        group_by="group",
        period="2d",
).agg([
        np.expm1(pl.col("value").log1p().sum()).alias(f"{agg}")
        for agg in ["foo", "bar", "egg"]
])
result.collect_schema()

So the absence of type information in collect_schema() when using a numpy ufunc looks relatively benign at first glance, but in the above example, it leads to an exception.

I wasn't able to further simplify the example. I'd be glad if someone more knowledgable could look into it.

Thanks a lot for the great library and congratulations to the 1.0.0 release.

Expected behavior

I'd have expected to get Float64 in both cases.

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.7.3 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 16.1.0 pydantic: 1.10.16 pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
ritchie46 commented 2 months ago

@itamarst could you take a look if it is possible to get dtypes of numpy ufuncs?