pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.63k stars 1.99k forks source link

Protection against integer overflows #8595

Open tzeitim opened 1 year ago

tzeitim commented 1 year ago

Problem description

It sometimes happens in large workflows that integers are downcasted and future computations suffer from integer overflows (non-polars issue I would say).

I am not sure about the ease of implementation or whether it would be costly computationally but it would be desirable to throw a warning when some integer overflow occurred.

In the next example I force one integer overflow.

In this example I insert numeric variable using .lit() and increase it's size by multiplying it by increasingly larger integer numbers until it collapses. What's interesting is that when this number has a larger capacity (like the 10000.0) this doesn't happen. I believe that the class of the first number is inherited downstream.

Would it be reasonable to have a warning for cases like this?

import polars as pl
df = pl.DataFrame({'b':500000})
a_number_that_came_from_another_place = 323566
(df
    .with_columns(pl.lit(a_number_that_came_from_another_place).alias('a'))
    .with_columns([

    (10 * pl.col('a')/pl.col('b')).alias('1'),
    (100 * pl.col('a')/pl.col('b')).alias('2'),
    (1000 * pl.col('a')/pl.col('b')).alias('3'),
    (10000 * pl.col('a')/pl.col('b')).alias('4'),
    (100000 * pl.col('a')/pl.col('b')).alias('5'),
    (100000.0 * pl.col('a')/pl.col('b')).alias('5_float'),
    ]

    )
)
shape: (1, 8)
┌────────┬────────┬─────────┬─────────┬─────────┬──────────────┬──────────────┬─────────┐
│ b      ┆ a      ┆ 1       ┆ 2       ┆ 3       ┆ 4            ┆ 5            ┆ 5_float │
│ ---    ┆ ---    ┆ ---     ┆ ---     ┆ ---     ┆ ---          ┆ ---          ┆ ---     │
│ i64    ┆ i32    ┆ f64     ┆ f64     ┆ f64     ┆ f64          ┆ f64          ┆ f64     │
╞════════╪════════╪═════════╪═════════╪═════════╪══════════════╪══════════════╪═════════╡
│ 500000 ┆ 323566 ┆ 6.47132 ┆ 64.7132 ┆ 647.132 ┆ -2118.614592 ┆ -4006.276736 ┆ 64713.2 │
└────────┴────────┴─────────┴─────────┴─────────┴──────────────┴──────────────┴─────────┘
ritchie46 commented 1 year ago

This would be very expensive to do by default in arithmetic. But we could expose checked variants. E.g. checked_add, checked_div etc.

alexander-beedie commented 1 year ago

we could expose checked variants.

Or a param on the existing methods? (Which would be tidier in terms of API surface, though also not as auto/tab-complete friendly... 🤔)

For example:

pl.col("x").add( pl.col("y"), checked=True )
pl.col("x").add_checked( pl.col("y") )
ritchie46 commented 1 year ago

In rust those methods already exist in std. In python we could use kwargs, but I believe the function would chain more nicely.

Have to think about this for a bit

avimallu commented 1 year ago

While this is being discussed, I want to bring up a nasty surprise which may not be obvious to users that need to find the difference between two "count" operations (which are assigned pl.UInt32) where integer overflow protection will help greatly:

>>> import polars as pl
>>> df1 = pl.DataFrame({
...     "A": [1, 2, 2, 3, 3, 3]})
>>> df2 = pl.DataFrame({
...     "B": [1, 1, 1, 2, 3, 3]})
>>> 
>>> (
...     df1
...     .groupby("A")
...     .count()
...     .join(df2
...           .groupby("B")
...           .count(), how="left", left_on="A", right_on="B")
...     .with_columns(
...         pl.col("count").sub(pl.col("count_right")).alias("difference"))
... )
shape: (3, 4)
┌─────┬───────┬─────────────┬────────────┐
│ A   ┆ count ┆ count_right ┆ difference │
│ --- ┆ ---   ┆ ---         ┆ ---        │
│ i64 ┆ u32   ┆ u32         ┆ u32        │
╞═════╪═══════╪═════════════╪════════════╡
│ 1   ┆ 1     ┆ 3           ┆ 4294967294 │
│ 2   ┆ 2     ┆ 1           ┆ 1          │
│ 3   ┆ 3     ┆ 2           ┆ 1          │
└─────┴───────┴─────────────┴────────────┘

I understand the rationale behind the output being an unsigned integer - a count obviously cannot be negative.

Here's why overflow protection is more important in this case: the average user doesn't really expect that the output of count will be an unsigned integer (fewer will even know what it is, and even more so since it's not documented). In a large dataframe, this overflow will happen so infrequently that a casual glance (a head or just display) will miss it.

theelderbeever commented 1 year ago

At the risk of asking a dumb question.... would compiling polars with overflow protection enabled be a possible solution? Since compiling in release doesn't do runtime checks for overflow by default. This would only apply to the python library most likely but that crowd might be the ones more likely to get surprised by this.

https://doc.rust-lang.org/cargo/reference/profiles.html#overflow-checks

aleewen commented 1 month ago

I ran into this issue recently. It would be great if either an exception was raised or casting to a signed Int was automatically applied. I would assume that raising an exception would be more appropriate. This can be really dangerous for allowing incorrect, unseen data to seep through the cracks.

import polars as pl

df = pl.DataFrame(data={"a": [1], "b": [2]}, schema={"a": pl.UInt32, "b": pl.UInt32})
df.select(pl.col("a") - pl.col("b"))

>>> shape: (1, 1)
┌────────────┐
│ a          │
│ ---        │
│ u32        │
╞════════════╡
│ 4294967295 │
└────────────┘