pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.97k stars 1.93k forks source link

`count().over()` a empty DataFrame with columns in Utf8 causes `ComputeError` #8822

Closed romiof closed 1 year ago

romiof commented 1 year ago

Polars version checks

Issue description

When I load a DataFrame and my source is Empty, I received a empty Polars DF.

In next step, I perform a COUNT with OVER in some columns.

In this case, if my column is a Utf8 type and is empty, i got a ComputeError crash.

ComputeError: cannot compare utf-8 with numeric data

If my empty column is, for instance, Int32/Float32 the COUNT OVER works fine.

I know that I could use a if df.is_empty bla bla, but I think this is a kind of bug.

Reproducible example

df = pl.DataFrame(
    {
        "ID": [],
        "DESC": [],
        "dataset": []
    }, 
    schema={
        "ID": pl.Utf8,
        "DESC": pl.Utf8,
        "dataset": pl.Utf8
    }
)

df_new = df.filter(pl.col("ID").count().over(["ID", "DESC"]) == 1).filter(pl.col("dataset") == "source")

print(df_new)

Traceback

```py --------------------------------------------------------------------------- ComputeError Traceback (most recent call last) Cell In[60], line 14 1 df = pl.DataFrame( 2 { 3 "ID": [], (...) 11 } 12 ) ---> 14 df_new = df.filter(pl.col("ID").count().over(["ID", "DESC"]) == 1).filter(pl.col("dataset") == "source") 15 print(df_new) File [~/projects/venv/lib/python3.10/site-packages/polars/dataframe/frame.py:3546](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/francis/etl_test/~/projects/venv/lib/python3.10/site-packages/polars/dataframe/frame.py:3546), in DataFrame.filter(self, predicate) 3542 if _check_for_numpy(predicate) and isinstance(predicate, np.ndarray): 3543 predicate = pl.Series(predicate) 3545 return self._from_pydf( -> 3546 self.lazy() 3547 .filter(predicate) # type: ignore[arg-type] 3548 .collect(no_optimization=True) 3549 ._df 3550 ) File [~/projects/venv/lib/python3.10/site-packages/polars/lazyframe/frame.py:1602](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/francis/etl_test/~/projects/venv/lib/python3.10/site-packages/polars/lazyframe/frame.py:1602), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming) 1591 common_subplan_elimination = False 1593 ldf = self._ldf.optimization_toggle( 1594 type_coercion, 1595 predicate_pushdown, (...) 1600 streaming, 1601 ) -> 1602 return wrap_df(ldf.collect()) ComputeError: cannot compare utf-8 with numeric data ```

Expected behavior

In this case, I expect return a Empty DataFrame.

This works:

df = pl.DataFrame(
    {
        "ID": [],
        "DESC": [],
        "dataset": []
    }, 
    schema={
        "ID": pl.Int32,
        "DESC": pl.Int32,
        "dataset": pl.Utf8
    }
)

df_new = df.filter(pl.col("ID").count().over(["ID", "DESC"]) == 1).filter(pl.col("dataset") == "source")
print(df_new)

And this works too:

df = pl.DataFrame(
    {
        "ID": [None],
        "DESC": [None],
        "dataset": [None]
    }, 
    schema={
        "ID": pl.Utf8,
        "DESC": pl.Utf8,
        "dataset": pl.Utf8
    }
)

df_new = df.filter(pl.col("ID").count().over(["ID", "DESC"]) == 1).filter(pl.col("dataset") == "source")
print(df_new)

Installed versions

``` --------Version info--------- Polars: 0.17.13 Index type: UInt32 Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] ----Optional dependencies---- numpy: 1.24.3 pandas: 2.0.1 pyarrow: 11.0.0 connectorx: 0.3.2a5 deltalake: fsspec: matplotlib: xlsx2csv: 0.8.1 xlsxwriter: ```
cmdlineluser commented 1 year ago

Perhaps a simpler representation of the problem is this does not return 0:

(pl.Series([], dtype=pl.Utf8).to_frame()
   .select(pl.all().count().over(True)))

# shape: (0, 1)
# ┌─────┐
# │     │
# │ --- │
# │ str │
# ╞═════╡
# └─────┘
mcrumiller commented 1 year ago

@cmdlineluser that feels like correct behavior to me. This is not a groupby.count() it's a count.over(), which should return a result for every existing row. There are no existing rows, so we get no results.

ritchie46 commented 1 year ago

Thanks @mcrumiller. That's indeed correct. For every group in over we must return the count. No groups, no count.

This is correct behavior.

cmdlineluser commented 1 year ago

@mcrumiller Yeah, good point.

It perhaps seems "wrong" that a .count() operation is returning a str dtype though.

Without .over it returns 0:

(pl.Series([], dtype=pl.Utf8).to_frame()
   .select(pl.all().count()))

# shape: (1, 1)
# ┌─────┐
# │     │
# │ --- │
# │ u32 │
# ╞═════╡
# │ 0   │
# └─────┘
mcrumiller commented 1 year ago

Yeah, it should definitely return an empty u32 column. @romiof that's what's causing your issue. .count() on an empty df returns the original column's dtype, in this case str:

df.select(pl.col("ID").count().over(["ID", "DESC"]))
shape: (0, 1)
┌─────┐
│ ID  │
│ --- │
│ str │
╞═════╡
└─────┘

This is definitely a bug. For now, you can get around it by casting the .count():

df_new = (
    df.filter(
        pl.col("ID").count().over(["ID", "DESC"]).cast(pl.UInt32) == 1  # note the cast
    ) 
    .filter(pl.col("dataset") == "source")
)
shape: (0, 3)
┌─────┬──────┬─────────┐
│ ID  ┆ DESC ┆ dataset │
│ --- ┆ ---  ┆ ---     │
│ str ┆ str  ┆ str     │
╞═════╪══════╪═════════╡
└─────┴──────┴─────────┘