Closed romiof closed 1 year ago
Perhaps a simpler representation of the problem is this does not return 0
:
(pl.Series([], dtype=pl.Utf8).to_frame()
.select(pl.all().count().over(True)))
# shape: (0, 1)
# ┌─────┐
# │ │
# │ --- │
# │ str │
# ╞═════╡
# └─────┘
@cmdlineluser that feels like correct behavior to me. This is not a groupby.count()
it's a count.over()
, which should return a result for every existing row. There are no existing rows, so we get no results.
Thanks @mcrumiller. That's indeed correct. For every group in over we must return the count. No groups, no count.
This is correct behavior.
@mcrumiller Yeah, good point.
It perhaps seems "wrong" that a .count()
operation is returning a str
dtype though.
Without .over
it returns 0
:
(pl.Series([], dtype=pl.Utf8).to_frame()
.select(pl.all().count()))
# shape: (1, 1)
# ┌─────┐
# │ │
# │ --- │
# │ u32 │
# ╞═════╡
# │ 0 │
# └─────┘
Yeah, it should definitely return an empty u32
column. @romiof that's what's causing your issue. .count()
on an empty df returns the original column's dtype, in this case str
:
df.select(pl.col("ID").count().over(["ID", "DESC"]))
shape: (0, 1)
┌─────┐
│ ID │
│ --- │
│ str │
╞═════╡
└─────┘
This is definitely a bug. For now, you can get around it by casting the .count()
:
df_new = (
df.filter(
pl.col("ID").count().over(["ID", "DESC"]).cast(pl.UInt32) == 1 # note the cast
)
.filter(pl.col("dataset") == "source")
)
shape: (0, 3)
┌─────┬──────┬─────────┐
│ ID ┆ DESC ┆ dataset │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪══════╪═════════╡
└─────┴──────┴─────────┘
Polars version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
When I load a DataFrame and my source is Empty, I received a empty Polars DF.
In next step, I perform a COUNT with OVER in some columns.
In this case, if my column is a Utf8 type and is empty, i got a
ComputeError
crash.ComputeError: cannot compare utf-8 with numeric data
If my empty column is, for instance, Int32/Float32 the COUNT OVER works fine.
I know that I could use a
if df.is_empty bla bla
, but I think this is a kind of bug.Reproducible example
Traceback
Expected behavior
In this case, I expect return a Empty DataFrame.
This works:
And this works too:
Installed versions