pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.11k stars 1.94k forks source link

Unimplemented join on list[i64] succeeds with empty dataframe in lazy mode #18120

Open CHDev93 opened 2 months ago

CHDev93 commented 2 months ago

Checks

Reproducible example


import polars as pl

a_df = pl.DataFrame({"a": [0,1,2,3]})
b_df = pl.DataFrame({"b": [1,2,3,4]})

# a, b
idx_df = a_df.join(b_df, how="cross")

c_df = pl.DataFrame({"c":[-1,0,1]})

# a, b, c, ch0
data_df = (
    idx_df.join(c_df, how="cross")
    .with_columns(
        pl.lit(0).alias("ch0"),
    )
)

time_idx_df = idx_df.join(c_df, how="cross").select(pl.concat_list("a", "b", "c").alias("index"))

data_df = data_df.select(pl.concat_list("a", "b", "c").alias("index").set_sorted())
data_lf = data_df.lazy()

tmp_lf = time_idx_df.lazy().join(
    data_lf,
    how="left",
    on="index",
    coalesce=True,
)

# a, b, c, ch0
result_df = tmp_lf.drop("index").fill_null(0).collect() # EMPTY DATAFRAME

# ComputeError: not yet implemented: Hash Left Join between list[i64] and list[i64]
# time_idx_df.join(
#     data_df,
#     how="left",
#     on="index",
#     coalesce=True,
# )

Log output

join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
LEFT join dataframes finished
Traceback (most recent call last):
  File "/persist/code/gr-goulash/polars_schema_error.py", line 45, in <module>
    time_idx_df.join(
  File "/persist/.virtualenvs/redcarpet/lib/python3.10/site-packages/polars/dataframe/frame.py", line 6870, in join
    self.lazy()
  File "/persist/.virtualenvs/redcarpet/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1942, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions

Issue description

Was trying to solve my join on a triple index key using too much memory and so tried creating a single list dtype column and joining on that.

Running in eager mode, the query fails because hash join is not implemented on the list dtype. Doing it with a lazy frame succeeds but silently returns an empty dataframe.

Expected behavior

Doing the operation with a lazy frame should also hard fail like in the commented code at the bottom of the example ( indicating that the join has not been implemented)

Installed versions

``` --------Version info--------- Polars: 1.4.1 Index type: UInt32 Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.6.1 gevent: great_tables: hvplot: matplotlib: 3.9.1 nest_asyncio: 1.6.0 numpy: 1.23.5 openpyxl: pandas: 1.5.3 pyarrow: 11.0.0 pydantic: 2.8.2 pyiceberg: sqlalchemy: torch: 2.4.0+cu121 xlsx2csv: xlsxwriter: ```
ritchie46 commented 2 months ago

Is there a minimal repro on this one? Seems to happen an lot of unneeded work/ and operations.

CHDev93 commented 2 months ago

@ritchie46 I've simplified the example in the main issue. Basically I think both lazy and eager path should fail with not implemented yet the lazy one succeeds with a 0 row dataframe