Open stevenlis opened 10 months ago
Polars follows Rusts conventions that counts/indices are unsigned as they cannot in principle be negative. This does lead to situations where comparisons via difference can cause overflow. So yes, this is intentional--you must be wary of your dtypes when performing subtraction--but it's not (as far as I know) documented, and it possibly should be.
I think for functions like value_counts
, rank
, pl.count
, row_count
, etc. that return unsigned integers, we should probably warn the user in the documentation.
@mcrumiller
I think for functions like
value_counts
,rank
,pl.count
,row_count
, etc. that return unsigned integers, we should probably warn the user in the documentation.
If this is a documentation issue and is accepted, I would like to work on it as the first issue
@chandler20708 go for it! Anyone is welcome to make a PR, our overlords will just ensure that it is properly reviewed prior to acceptance.
I probably did not provide an exhaustive list of all such functions, so feel free to do some more API snooping to find more.
@mcrumiller @chandler20708 Actually, I think we should consider adding a return_dtype
parameter to the value_counts()
function. If Polars ever introduces a normalized
option in value_counts()
, the data type of the "counts" column will no longer be UInt32
, I believe.
@stevenlis yes, but that does not require a return_dtype
parameter. We have functions whose return dtype depends on the input parameters, as does pl.cut returning categorical data when include_breaks=True
. This is one of those cases where I think it's up to the user to perform the cast to signed if they want to do subtraction.
An interesting place this popped up was having pl.count()
in the wrong place inside .repeat_by()
# Don't run this
# pl.select(pl.lit(1).repeat_by(1 - (pl.count() + 2)))
Which ended up trying to repeat something 4 billion times.
pl.select(1 - (pl.count() + 2))
# shape: (1, 1)
# ┌────────────┐
# │ literal │
# │ --- │
# │ u32 │
# ╞════════════╡
# │ 4294967295 │
# └────────────┘
Description
I tried using
value_counts()
, which returns au32
counts
column, to compare two columns. However, I noticed that there is no upcasting when the subtraction results in a negative number. Is this intentional?