pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.21k stars 1.95k forks source link

uint overflow #19522

Closed eromoe closed 1 week ago

eromoe commented 1 week ago

Checks

Reproducible example

pl.select(pl.lit(48).cast(pl.UInt32)-pl.lit(850).cast(pl.UInt32))

Log output

literal
u32
4294966494

Issue description

I am checking agg result and find this problem,

    df_agg = df.group_by('datetime').agg(
        uc=pl.col('hit_upper').sum(),
        dc=pl.col('hit_lower').sum(),
    ).with_columns(
        ud_ratio=((pl.col('uc')+1)/(pl.col('dc')+1)).log(), # avoid divide zeroe
        ud_diff=(pl.col('uc')-pl.col('dc')),
    ).collect().to_pandas()

Expected behavior

Detect and avoid overflow, automatically convert pl.Uint to pl.Int

Installed versions

``` --------Version info--------- Polars: 1.9.0 Index type: UInt32 Platform: Windows-10-10.0.19041-SP0 Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:40:08) [MSC v.1938 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle 3.0.0 connectorx deltalake fastexcel fsspec 2024.3.1 gevent 24.2.1 great_tables matplotlib 3.8.4 nest_asyncio 1.6.0 numpy 1.24.4 openpyxl 3.1.2 pandas 2.2.2 pyarrow 15.0.2 pydantic 2.6.4 pyiceberg sqlalchemy 2.0.29 torch xlsx2csv xlsxwriter ```
cmdlineluser commented 1 week ago

I think .len() was the last time this topic came up.

https://github.com/pola-rs/polars/issues/17722#issuecomment-2244468461

However, I do think that returning i64 for len is something that we should investigate for 2.0. It is a footgun.

(Although that was just for .len())

orlp commented 1 week ago

Not a bug, something we're thinking about perhaps changing for Polars 2.0, but current semantics for addition/subtraction is that they're wrapping.

eromoe commented 1 week ago

Not relate to len , my data loading by pandas would not contain uint type, only in polars. Yes, the the values do not contain negative, polars assume their dtypes be uint is fine. But subtruction should not follow this .