pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.67k stars 1.79k forks source link

Subtracting an unsigned integer and performing an upcast #11465

Open stevenlis opened 10 months ago

stevenlis commented 10 months ago

Description

I tried using value_counts(), which returns a u32 counts column, to compare two columns. However, I noticed that there is no upcasting when the subtraction results in a negative number. Is this intentional?

import polars as pl

pl.DataFrame(
    data={'a': [1, 1, 1], 'b': [2, 2, 2]},
    schema={'a': pl.UInt32, 'b': pl.UInt32}
).with_columns(
    pl.col('a').sub(pl.col('b')).alias('diff')
)
shape: (3, 3)
┌─────┬─────┬────────────┐
│ a   ┆ b   ┆ diff       │
│ --- ┆ --- ┆ ---        │
│ u32 ┆ u32 ┆ u32        │
╞═════╪═════╪════════════╡
│ 1   ┆ 2   ┆ 4294967295 │
│ 1   ┆ 2   ┆ 4294967295 │
│ 1   ┆ 2   ┆ 4294967295 │
└─────┴─────┴────────────┘
mcrumiller commented 10 months ago

Polars follows Rusts conventions that counts/indices are unsigned as they cannot in principle be negative. This does lead to situations where comparisons via difference can cause overflow. So yes, this is intentional--you must be wary of your dtypes when performing subtraction--but it's not (as far as I know) documented, and it possibly should be.

I think for functions like value_counts, rank, pl.count, row_count, etc. that return unsigned integers, we should probably warn the user in the documentation.

chandler20708 commented 10 months ago

@mcrumiller

I think for functions like value_counts, rank, pl.count, row_count, etc. that return unsigned integers, we should probably warn the user in the documentation.

If this is a documentation issue and is accepted, I would like to work on it as the first issue

mcrumiller commented 10 months ago

@chandler20708 go for it! Anyone is welcome to make a PR, our overlords will just ensure that it is properly reviewed prior to acceptance.

I probably did not provide an exhaustive list of all such functions, so feel free to do some more API snooping to find more.

stevenlis commented 10 months ago

@mcrumiller @chandler20708 Actually, I think we should consider adding a return_dtype parameter to the value_counts() function. If Polars ever introduces a normalized option in value_counts(), the data type of the "counts" column will no longer be UInt32, I believe.

mcrumiller commented 10 months ago

@stevenlis yes, but that does not require a return_dtype parameter. We have functions whose return dtype depends on the input parameters, as does pl.cut returning categorical data when include_breaks=True. This is one of those cases where I think it's up to the user to perform the cast to signed if they want to do subtraction.

cmdlineluser commented 9 months ago

An interesting place this popped up was having pl.count() in the wrong place inside .repeat_by()

# Don't run this
# pl.select(pl.lit(1).repeat_by(1 - (pl.count() + 2)))

Which ended up trying to repeat something 4 billion times.

pl.select(1 - (pl.count() + 2))
# shape: (1, 1)
# ┌────────────┐
# │ literal    │
# │ ---        │
# │ u32        │
# ╞════════════╡
# │ 4294967295 │
# └────────────┘