pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.96k stars 1.93k forks source link

Implement merge_sorted for struct-type columns #13485

Open Hoeze opened 9 months ago

Hoeze commented 9 months ago

Description

Merging sorted dataframes on a struct-type key is not implemented yet:

test_df = pl.DataFrame(
    {
        "idx_1": [1, 2, 3, 1, 2, 3],
        "idx_2": [4, 4, 5, 5, 6, 6],
        "value": [1, 2, 3, 4, 5, 6],
    }
)
test_df = test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')

test_df.merge_sorted(test_df, key="key")
thread '<unnamed>' panicked at crates/polars-ops/src/frame/join/merge_sorted.rs:144:13:
not implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[5], line 10
      1 test_df = pl.DataFrame(
      2     {
      3         "idx_1": [1, 2, 3, 1, 2, 3],
   (...)
      6     }
      7 )
      8 test_df = test_df.with_columns(key=pl.struct('idx_1','idx_2')).sort('key')
---> 10 test_df.merge_sorted(test_df, key="key")

File /opt/anaconda/lib/python3.10/site-packages/polars/dataframe/frame.py:10221, in DataFrame.merge_sorted(self, other, key)
  10157 def merge_sorted(self, other: DataFrame, key: str) -> DataFrame:
  10158     """
  10159     Take two sorted DataFrames and merge them by the sorted key.
  10160 
   (...)
  10219     └────────┴─────┘
  10220     """
> 10221     return self.lazy().merge_sorted(other.lazy(), key).collect(_eager=True)

File /opt/anaconda/lib/python3.10/site-packages/polars/lazyframe/frame.py:1706, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, _eager)
   1693     comm_subplan_elim = False
   1695 ldf = self._ldf.optimization_toggle(
   1696     type_coercion,
   1697     predicate_pushdown,
   (...)
   1704     _eager,
   1705 )
-> 1706 return wrap_df(ldf.collect())

PanicException: not implemented

See also https://github.com/pola-rs/polars/issues/10935#issuecomment-1879718671

I tried this with latest polars v0.22.2:

``` --------Version info--------- Polars: 0.20.2 Index type: UInt32 Platform: Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.28 Python: 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: matplotlib: numpy: 1.26.2 openpyxl: pandas: 2.1.4 pyarrow: 14.0.1 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
mmcdermott commented 4 months ago

This functionality would be very helpful, also because merge_sorted only supports a single key column, and structs are a natural way to sort by multiple indices at once.

Hoeze commented 3 months ago

This is still an issue with v1.0