pl.UInt32 subtraction leading to weird results

miroslaavi commented 3 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

pl.DataFrame(
    {
        "A": [21506, 1719, 1983, 5736],
        "B": [21546, 1711, 1918, 5861],
    },
    schema={"A": pl.UInt32, "B": pl.UInt32}
).with_columns(
    (pl.col("A") - pl.col("B")).alias("diff")
)

Out:

┌───────┬───────┬────────────┐
│ A     ┆ B     ┆ diff       │
│ ---   ┆ ---   ┆ ---        │
│ u32   ┆ u32   ┆ u32        │
╞═══════╪═══════╪════════════╡
│ 21506 ┆ 21546 ┆ 4294967256 │
│ 1719  ┆ 1711  ┆ 8          │
│ 1983  ┆ 1918  ┆ 65         │
│ 5736  ┆ 5861  ┆ 4294967171 │
└───────┴───────┴────────────┘

Log output

No response

Issue description

I have a LazyFrame workflow that casts some values to pl.UInt32 and noticed that the subtraction leading to strange results as shown in the above example.

Expected behavior

Subtraction to work normally

Installed versions

```

polars.config.Config

shape: (6, 12) Category | revenue | units | price | Category_LY | revenue_LY | units_LY | price_LY | volume impact | mix impact | price impact | total impact -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- str | f64 | u32 | f64 | str | f64 | u32 | f64 | f64 | f64 | f64 | f64 "Bracelets" | 332726.735364 | 5736 | 58.006753 | "Bracelets" | 344314.626605 | 5861 | 58.746737 | 1.7974e11 | 7.2579e10 | -4244.549093 | 2.5232e11 "Rings" | 238523.950732 | 4255 | 56.057333 | "Rings" | 235330.836392 | 4190 | 56.164877 | 2720.1274 | 930.589632 | -457.602693 | 3193.114339 "Other" | 24464.686408 | 2186 | 11.191531 | "Other" | 24799.855995 | 2157 | 11.497383 | 1213.595302 | -880.171183 | -668.593705 | -335.169587 "Charms" | 737751.979534 | 21506 | 34.304472 | "Charms" | 740521.83544 | 21546 | 34.369342 | 1.7974e11 | -3.2121e10 | -1395.08224 | 1.4762e11 "Necklaces & Pendants" | 119051.138037 | 1719 | 69.256043 | "Necklaces & Pendants" | 117704.075248 | 1711 | 68.792563 | 334.784911 | 215.555593 | 796.722285 | 1347.062789 "Earrings" | 104449.774471 | 1983 | 52.672604 | "Earrings" | 101736.810394 | 1918 | 53.043175 | 2720.1274 | 727.679 | -734.842324 | 2712.964077

shape: (4, 3)
┌───────┬───────┬────────────┐
│ A     ┆ B     ┆ diff       │
│ ---   ┆ ---   ┆ ---        │
│ u32   ┆ u32   ┆ u32        │
╞═══════╪═══════╪════════════╡
│ 21506 ┆ 21546 ┆ 4294967256 │
│ 1719  ┆ 1711  ┆ 8          │
│ 1983  ┆ 1918  ┆ 65         │
│ 5736  ┆ 5861  ┆ 4294967171 │
└───────┴───────┴────────────┘

--------Version info---------
Polars:               0.20.17
Index type:           UInt32
Platform:             Windows-10-10.0.22621-SP0
Python:               3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            0.8.0
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
nest_asyncio:         1.6.0
numpy:                1.26.3
openpyxl:             3.1.2
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0

```

ritchie46 commented 3 months ago

That's underflow. We don't check for over/underflow for performance reasons. This is similar to what numpy does:

np.array([1, 2, 3], dtype=np.uint32) - 10
array([4294967287, 4294967288, 4294967289], dtype=uint32)

So cast to signed integers when doing arithmetic and subtraction.

miroslaavi commented 3 months ago

Got it, thank you. Just general feedback from non-technical user that this might be confusing at times especially if not familiar with the differences of pl.Int32 and pl.UInt32.

For instance in my case a group_by("category").agg(pl.col("Item ID").count()) leads to the type pl.UInt32, so would it be better to be defaulting to pl.Int32 to avoid such cases?

reswqa commented 3 months ago

For instance in my case a group_by("category").agg(pl.col("Item ID").count()) leads to the type pl.UInt32, so would it be better to be defaulting to pl.Int32 to avoid such cases?

Count returning signed types is even more mind-boggling to me, and would make the range it can represent smaller.

pola-rs / polars