pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.09k stars 1.83k forks source link

pl.list,len() - pl.list,len() always returning u32 no matter the results #17722

Open atigbadr opened 1 month ago

atigbadr commented 1 month ago

Checks

Reproducible example

import polars as pl
df = pl.DataFrame({"foo": [["a", "b"], ["a", "b", "c"]], "bar": [["a", "b", "c"], ["a", "b"]]})
df.select((pl.col("foo").list.len() - pl.col("bar").list.len().alias("difference")))

#shape: (2, 1)
#┌────────────┐
#│ foo        │
#│ ---        │
#│ u32        │
#╞════════════╡
#│ 4294967295 │
#│ 1          │
#└────────────┘

df.select((pl.col("foo").list.len() - pl.col("bar").list.len().cast(pl.Float32).alias("difference")))
#shape: (2, 1)
#┌──────┐
#│ foo  │
#│ ---  │
#│ f64  │
#╞══════╡
#│ -1.0 │
#│ 1.0  │
#└──────┘

This can be problematic for users assuming it would infer the right type depending on the out come, i'm not sure if this is the right behaviour or not, but in either case it could cause a lot of issues/unexpected outcomes.

Log output

No response

Issue description

Subtracting the len of two lists always results in u32

Expected behavior

Subtracting the len of two lists should infer the type correctly, especially if the both lists contains lengths that are bigger than other and vice versa

Installed versions

``` --------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: Linux-6.9.7-100.fc39.x86_64-x86_64-with-glibc2.38 Python: 3.12.2 (main, Feb 12 2024, 23:51:02) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
Julian-J-S commented 1 month ago

This is expected!

A function should have a consistent output type and not change the type depending on the data.

If you require a different type you should be explicit abount it.

In your example you shoud cast to a signed intger like "Int8, Int32, ..." depending on your requirements 😉

ggggggggg commented 1 month ago

Agreed that the return type should be constant, but it should be a signed integer so that people don't shoot themselves in the foot when they subtract lengths.

deanm0000 commented 1 month ago

It should probably raise invalid operation for substraction on uint.

ritchie46 commented 1 month ago

Subtraction on unsigned integers is a valid operation.

ggggggggg commented 1 month ago

I said before len should return a signed integer. Here I will elaborate. The use of unsigned integers is a giant footgun waiting to surprise the unsuspecting because it always violates the rule that 'a-b=-(b-a). So code will work for a while as long as you start by subtracting the smaller thing, but then break when you subtract the bigger thing. The Python API at least should not use unsigned integers at all IMHO. Just always return a 64 bit signed int, for len in the python API, there is no plausible circumstance where it will overflow since the upper limit is well beyond the total number of bytes of storage all humans together have produced.

Returning a signed int is consistent with the python len function, which is what most python users will have experience with. Matching the behavior of the built in len function will provide the least surprise in the python API.

For rust it is at least plausible that it makes sense to use unsigned ints since maybe there are use cases that get a speed advantage from that factor of 2. But in the python API you're already in python, speed isn't the goal. Ease of use and predictability are the goals, and signed integers come out WAY ahead in those departments.

The eigen C++ library gives a good argument for the use of signed integers to avoid bugs:

At a first glance it is indeed very tempting to use an unsigned type for values that are expected to be positive or zero. However, unsigned types relies on modulo 2^k arithmetic, which is definitely not what we want here. Moreover, in many places we can have negative indices and offsets. Mixing signed and unsigned integers is costly, error prone, and highly correlated to bugs. As a result, it is much simpler and safer to use signed integers everywhere, and reserve the usage of unsigned integers for bitwise manipulation and modulo 2^k arithmetic.

Hm... but the STL use unsigned integers everywhere. Right, but this has been acknowledge as a mistake by several members of the C++ standardization committee and C++ experts (Bjarne Stroustrup, Herb Sutter, Chandler Carruth, Eric Niebler, etc.). Sources:

source

ritchie46 commented 1 month ago

But in the python API you're already in python, speed isn't the goal.

I can assure you, speed is a goal for this library.

I do agree this is problematic and I have been thinking about this for a while. My main problem is, is that we use unsigned integers for gather operations, which are common under the hood. I don't want to change that, as it has huge performance implications.

However, I do think that returning i64 for len is something that we should investigate for 2.0. It is a footgun.

deanm0000 commented 1 month ago

@ritchie46 what about the pyarrow approach of having checked and unchecked versions of functions? That way users that want the performance of unchecked can be responsible for overflows while giving the option to other users to have checked operations that raise with an overflow.

ritchie46 commented 1 month ago

That would lead to huge compiler bloat as you double all your arithmetic kernels. I am not convinced this is worth the extra binary size. For instance numpy also doesn't any checked arithmetic.