pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.65k stars 1.99k forks source link

DataFrame.sort() raises PanicException: '`arg_sort` operation not supported for dtype `list[i64]`' #10047

Open vinhloc30796 opened 1 year ago

vinhloc30796 commented 1 year ago

Problem description

Using df.sort() with a list[i64] column raises an error pointing to:

https://github.com/pola-rs/polars/blob/master/polars/polars-core/src/series/series_trait.rs#L390-L399

Also mentioned as a part of #7777

Thanks!

Polars version checks

Reproducible example

>>> import polars as pl
>>> df = pl.from_records(
...     [
...         {"a": []},
...         {"a": [1, 2]},
...         {"a": [1]},
...         {"a": [2, 3, 5]},
...         {"a": [0]},
...         {"a": [1, 1]},
...         {"a": [1, 0]},
...     ],
...     schema={"a": pl.List(inner=int)},
... )
>>> df
shape: (7, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ [1, 2]    │
│ [1]       │
│ [2, 3, 5] │
│ [0]       │
│ [1, 1]    │
│ [1, 0]    │
└───────────┘
>>> df.sort("a")
thread '<unnamed>' panicked at '`sort_with` operation not supported for dtype `list[i64]`', /home/runner/work/polars/polars/polars/polars-core/src/series/series_trait.rs:364:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vinhloc30796/.pyenv/versions/3.10.8/lib/python3.10/site-packages/polars/dataframe/frame.py", line 3992, in sort
    self.lazy()
  File "/home/vinhloc30796/.pyenv/versions/3.10.8/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1508, in collect
    return wrap_df(ldf.collect())
pyo3_runtime.PanicException: `sort_with` operation not supported for dtype `list[i64]

Expected behavior

shape: (7, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ [0]       │
│ [1]       │
│ [1, 0]    │
│ [1, 1]    │
│ [1, 2]    │
│ [2, 3, 5] │
└───────────┘
sjt-motif commented 1 year ago

Just wanted to mention that this fixing this would be very helpful for a use case that I have.

cmdlineluser commented 1 year ago

@sjt-motif You could convert it to a struct as a temporary workaround.

df.select(pl.col("a").sort_by(pl.col("a").list.to_struct("max_width")))

# shape: (7, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[i64] │
# ╞═══════════╡
# │ []        │
# │ [0]       │
# │ [1]       │
# │ [1, 0]    │
# │ [1, 1]    │
# │ [1, 2]    │
# │ [2, 3, 5] │
# └───────────┘
sjt-motif commented 1 year ago

@cmdlineluser Thanks! It works great except the descending=True version exhibits some weird behavior:

>>> df.select(pl.col("a").sort_by(pl.col("a").list.to_struct("max_width"), descending=True))

shape: (7, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ [2, 3, 5] │
│ [1]       │
│ [1, 2]    │
│ [1, 1]    │
│ [1, 0]    │
│ [0]       │
└───────────┘

That's quite easy for me to workaround though. Thanks!

cmdlineluser commented 1 year ago

Hm yeah, not sure why [1] comes first there, there's also no nulls_last= for .sort_by

I guess this is closer to what you want:

(df.with_columns(sort_by = pl.col("a").list.to_struct("max_width"))
   .sort("sort_by", descending=True, nulls_last=True))

# shape: (7, 2)
# ┌───────────┬──────────────────┐
# │ a         ┆ sort_by          │
# │ ---       ┆ ---              │
# │ list[i64] ┆ struct[3]        │
# ╞═══════════╪══════════════════╡
# │ [2, 3, 5] ┆ {2,3,5}          │
# │ [1, 2]    ┆ {1,2,null}       │
# │ [1, 1]    ┆ {1,1,null}       │
# │ [1, 0]    ┆ {1,0,null}       │
# │ [1]       ┆ {1,null,null}    │
# │ [0]       ┆ {0,null,null}    │
# │ []        ┆ {null,null,null} │
# └───────────┴──────────────────┘
vinhloc30796 commented 1 year ago

Thanks @cmdlineluser, I basically did that, but dumber (manually find max length by agg.max(), then lambda s: list(s) + [None} * max_length, then finally explode the columns & sort by all).

Your way is wayyyy shorter lol.

trinebrockhoff commented 1 year ago

If this one gets fixed, then assert_frame_equal will also work for pl.List cols:

import polars as pl
from polars.testing import assert_frame_equal

df1 = pl.DataFrame({"A":[1,2,3,4,5], "B":["H","E","L","L","O"], "C":[[0,0],[0,0],[0,0],[0,0],[0,0]]})
df2 = df1.sort("B")

assert_frame_equal(df1, df2, check_row_order=False)

# ComputeError: cannot sort column of dtype `list[i64]`
# InvalidAssert: cannot set 'check_row_order=False' on frame with unsortable columns
cmdlineluser commented 1 year ago

@trinebrockhoff Even though it's related, perhaps that deserves its own issue?

That particular use-case seems like something that warrants a higher priority.

I also didn't realize at the time of my previous comment that you can pass expressions directly to .sort() - so the .with_columns() wasn't actually needed.

df.sort(
   pl.col("a").list.to_struct("max_width"),
   descending=True,
   nulls_last=True
)   

# shape: (7, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[i64] │
# ╞═══════════╡
# │ [2, 3, 5] │
# │ [1, 2]    │
# │ [1, 1]    │
# │ [1, 0]    │
# │ [1]       │
# │ [0]       │
# │ []        │
# └───────────┘
maxzw commented 7 months ago

It also doesn't seem to be supported for lists of structs:

schema = {"value": pl.List(pl.Struct({"a": pl.Int32}))}
data = {"value": [[{"a": 1}, {"a": 2}]]}
df = pl.DataFrame(data, schema=schema)
df.sort("value")

Resulting in

InvalidOperationError: `sort_with` operation not supported for dtype `list[struct[1]]`

It would be really nice if sort would support pl.List datatypes, especially since the assert_frame_equal function with check_row_order=False will fail now.

cmdlineluser commented 7 months ago

@maxzw Yeah, it seems that example also causes .group_by to panic.

df.group_by("value").all()
thread '' panicked at crates/polars-core/src/frame/group_by/into_groups.rs:296:52:
PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("cannot sort column of dtype `list[struct[1]]`"))
failable commented 4 months ago

Similar issue?

polars.exceptions.InvalidOperationError: `arg_sort_multiple` operation not supported for dtype `list[str]`
fbscarel commented 1 month ago

This seems similar:

polars.exceptions.InvalidOperationError: cannot sort column of dtype `list[struct[7]]`

Input dataframe is like this, from .glimpse():

$ lines              <list[struct[7]]> [{'description': 'Parts and Supplies', 'unitAmount': 88.33, 'quantity': 1, 'taxRateRef': {'id': 'NON'}, 'inventoryRef': {'id': '75', 'name': 'Sales-Products-Hardware'}, 'id': '1', 'amount': 88.33}], (...)