pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.29k stars 1.85k forks source link

In `cast()`, the argument `wrap_numerical` works differently on floats and integers #18546

Open etiennebacher opened 1 week ago

etiennebacher commented 1 week ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame({"float": [100.0, 300], "int": [100, 300]})

df.with_columns(
    wrapped_float=pl.col("float").cast(pl.UInt8, wrap_numerical=True),
    wrapped_int=pl.col("int").cast(pl.UInt8, wrap_numerical=True),
)

Log output

shape: (2, 4)
┌───────┬─────┬───────────────┬─────────────┐
│ float ┆ int ┆ wrapped_float ┆ wrapped_int │
│ ---   ┆ --- ┆ ---           ┆ ---         │
│ f64   ┆ i64 ┆ u8            ┆ u8          │
╞═══════╪═════╪═══════════════╪═════════════╡
│ 100.0 ┆ 100 ┆ 100           ┆ 100         │
│ 300.0 ┆ 300 ┆ 255           ┆ 44          │
└───────┴─────┴───────────────┴─────────────┘

Issue description

In cast(), the argument wrap_numerical has a different output depending on whether the input column is float or int:

Expected behavior

I suppose it should wrap the value in both cases.

Installed versions

``` --------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec 2023.6.0 gevent great_tables matplotlib 3.7.1 nest_asyncio 1.5.6 numpy 1.24.3 openpyxl pandas 2.0.3 pyarrow 12.0.1 pydantic 2.6.4 pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
orlp commented 1 week ago

This is going to be a bit tricky, involving f64::trunc and then I think manually shifting the mantissa being careful with handling shifts larger than the width.

EDIT: actually the simplest implementation will be to just use x.trunc().rem_euclid(uT::MAX + 1) as uT for now at least.