pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.22k stars 1.95k forks source link

Error when the result of joining two `DataFrame`s is empty, and one of the `DataFrame`s contains an array column #15474

Open CyborgSquirrel opened 7 months ago

CyborgSquirrel commented 7 months ago

Checks

Reproducible example

import polars as pl

a = pl.DataFrame([
    pl.Series("x", [1, 2, 3]),
    pl.Series("y", [[1, 2, 3], [4, 5, 6], [7, 8, 9]]).list.to_array(3),
])

b = pl.DataFrame([
    pl.Series("x", [0]),
])

print(a.join(b, "x"))

Log output

join parallel: true
thread 'polars-1' panicked at /home/runner/work/polars/polars/crates/polars-arrow/src/array/static_array_collect.rs:929:18:
called `Result::unwrap()` on an `Err` value: InvalidOperation(ErrString("concat requires input of at least one array"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/andrei/example.py", line 13, in <module>
    print(a.join(b, "x"))
  File "/home/andrei/.local/lib/python3.10/site-packages/polars/dataframe/frame.py", line 6392, in join
    self.lazy()
  File "/home/andrei/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1943, in collect
    return wrap_df(ldf.collect())
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: InvalidOperation(ErrString("concat requires input of at least one array"))

Issue description

Tested on Ubuntu 22.04.4 LTS.

Same error happens if I swap a and b in the join.

import polars as pl

a = pl.DataFrame([
    pl.Series("x", [1, 2, 3]),
    pl.Series("y", [[1, 2, 3], [4, 5, 6], [7, 8, 9]]).list.to_array(3),
])

b = pl.DataFrame([
    pl.Series("x", [0]),
])

print(b.join(a, "x")) # swapped a with b

If I use a list instead of an array, it works as expected.

import polars as pl

a = pl.DataFrame([
    pl.Series("x", [1, 2, 3]),
    pl.Series("y", [[1, 2, 3], [4, 5, 6], [7, 8, 9]]), # removed cast to array
])

b = pl.DataFrame([
    pl.Series("x", [0]),
])

print(a.join(b, "x"))

If the result of joining a and b is not empty, then it also works as expected.

import polars as pl

a = pl.DataFrame([
    pl.Series("x", [1, 2, 3]),
    pl.Series("y", [[1, 2, 3], [4, 5, 6], [7, 8, 9]]).list.to_array(3),
])

b = pl.DataFrame([
    pl.Series("x", [0, 1, 2]), # added some values
])

print(a.join(b, "x"))

Expected behavior

The join should output an empty DataFrame. For the provided example, the output should look something like this:

┌─────┬───────────────┐
│ x   ┆ y             │
│ --- ┆ ---           │
│ i64 ┆ array[i64, 3] │
╞═════╪═══════════════╡
└─────┴───────────────┘

Installed versions

``` --------Version info--------- Polars: 0.20.18 Index type: UInt32 Platform: Linux-6.5.0-26-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2023.12.2 gevent: hvplot: matplotlib: 3.5.1 nest_asyncio: numpy: 1.21.5 openpyxl: 3.0.9 pandas: 1.3.5 pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
chmp commented 6 months ago

The same bug seems to be triggered, when performing group-by agg on empty data frames with pl.Array() columns.

Example:

# works
(
  pl.DataFrame({
    "id": pl.Series([], dtype=pl.Utf8()), 
    "val": pl.Series([], dtype=pl.List(pl.Float32)),
  })
  .group_by(pl.col("id"))
  .first()
)

# fails
(
  pl.DataFrame({
    "id": pl.Series([], dtype=pl.Utf8()), 
    "val": pl.Series([], dtype=pl.Array(pl.Float32, 20),
  )})
  .group_by(pl.col("id"))
  .first()
)
# exception: `PanicException: called `Result::unwrap()` on an `Err` value: InvalidOperation(ErrString("concat requires input of at least one array"))`.
Backtrace & Installed versions ```python --------------------------------------------------------------------------- PanicException Traceback (most recent call last) Cell In[580], line 1 ----> 1 pl.DataFrame({"id": pl.Series([], dtype=pl.Utf8()), "val": pl.Series([], dtype=pl.Array(pl.Float32, 20))}).group_by(pl.col("id")).first() File ~\miniconda3\envs\py311\Lib\site-packages\polars\dataframe\group_by.py:546, in GroupBy.first(self) 520 def first(self) -> DataFrame: 521 """ 522 Aggregate the first values in the group. 523 (...) 544 └────────┴─────┴──────┴───────┘ 545 """ --> 546 return self.agg(F.all().first()) File ~\miniconda3\envs\py311\Lib\site-packages\polars\dataframe\group_by.py:250, in GroupBy.agg(self, *aggs, **named_aggs) 141 def agg( 142 self, 143 *aggs: IntoExpr | Iterable[IntoExpr], 144 **named_aggs: IntoExpr, 145 ) -> DataFrame: 146 """ 147 Compute aggregations for each group of a group by operation. 148 (...) 244 └─────┴───────┴────────────────┘ 245 """ 246 return ( 247 self.df.lazy() 248 .group_by(*self.by, **self.named_by, maintain_order=self.maintain_order) 249 .agg(*aggs, **named_aggs) --> 250 .collect(no_optimization=True) 251 ) File ~\miniconda3\envs\py311\Lib\site-packages\polars\lazyframe\frame.py:1810, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager) 1807 if background: 1808 return InProcessQuery(ldf.collect_concurrently()) -> 1810 return wrap_df(ldf.collect()) PanicException: called `Result::unwrap()` on an `Err` value: InvalidOperation(ErrString("concat requires input of at least one array")) ``` ``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Windows-10-10.0.22631-SP0 Python: 3.11.3 | packaged by Anaconda, Inc. | (main, Apr 19 2023, 23:46:34) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2023.10.0 gevent: hvplot: matplotlib: 3.8.0 nest_asyncio: 1.5.6 numpy: 1.26.1 openpyxl: pandas: 2.0.3 pyarrow: 12.0.1 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```