Open tzeitim opened 1 year ago
This would be very expensive to do by default in arithmetic. But we could expose checked
variants. E.g. checked_add
, checked_div
etc.
we could expose checked variants.
Or a param on the existing methods? (Which would be tidier in terms of API surface, though also not as auto/tab-complete friendly... 🤔)
For example:
pl.col("x").add( pl.col("y"), checked=True )
pl.col("x").add_checked( pl.col("y") )
In rust those methods already exist in std. In python we could use kwargs, but I believe the function would chain more nicely.
Have to think about this for a bit
While this is being discussed, I want to bring up a nasty surprise which may not be obvious to users that need to find the difference between two "count" operations (which are assigned pl.UInt32
) where integer overflow protection will help greatly:
>>> import polars as pl
>>> df1 = pl.DataFrame({
... "A": [1, 2, 2, 3, 3, 3]})
>>> df2 = pl.DataFrame({
... "B": [1, 1, 1, 2, 3, 3]})
>>>
>>> (
... df1
... .groupby("A")
... .count()
... .join(df2
... .groupby("B")
... .count(), how="left", left_on="A", right_on="B")
... .with_columns(
... pl.col("count").sub(pl.col("count_right")).alias("difference"))
... )
shape: (3, 4)
┌─────┬───────┬─────────────┬────────────┐
│ A ┆ count ┆ count_right ┆ difference │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 ┆ u32 │
╞═════╪═══════╪═════════════╪════════════╡
│ 1 ┆ 1 ┆ 3 ┆ 4294967294 │
│ 2 ┆ 2 ┆ 1 ┆ 1 │
│ 3 ┆ 3 ┆ 2 ┆ 1 │
└─────┴───────┴─────────────┴────────────┘
I understand the rationale behind the output being an unsigned integer - a count obviously cannot be negative.
Here's why overflow protection is more important in this case: the average user doesn't really expect that the output of count
will be an unsigned integer (fewer will even know what it is, and even more so since it's not documented). In a large dataframe, this overflow will happen so infrequently that a casual glance (a head
or just display
) will miss it.
At the risk of asking a dumb question.... would compiling polars with overflow protection enabled be a possible solution? Since compiling in release doesn't do runtime checks for overflow by default. This would only apply to the python library most likely but that crowd might be the ones more likely to get surprised by this.
https://doc.rust-lang.org/cargo/reference/profiles.html#overflow-checks
I ran into this issue recently. It would be great if either an exception was raised or casting to a signed Int was automatically applied. I would assume that raising an exception would be more appropriate. This can be really dangerous for allowing incorrect, unseen data to seep through the cracks.
import polars as pl
df = pl.DataFrame(data={"a": [1], "b": [2]}, schema={"a": pl.UInt32, "b": pl.UInt32})
df.select(pl.col("a") - pl.col("b"))
>>> shape: (1, 1)
┌────────────┐
│ a │
│ --- │
│ u32 │
╞════════════╡
│ 4294967295 │
└────────────┘
Problem description
It sometimes happens in large workflows that integers are downcasted and future computations suffer from integer overflows (non-polars issue I would say).
I am not sure about the ease of implementation or whether it would be costly computationally but it would be desirable to throw a warning when some integer overflow occurred.
In the next example I force one integer overflow.
In this example I insert numeric variable using
.lit()
and increase it's size by multiplying it by increasingly larger integer numbers until it collapses. What's interesting is that when this number has a larger capacity (like the10000.0
) this doesn't happen. I believe that the class of the first number is inherited downstream.Would it be reasonable to have a warning for cases like this?