pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.99k stars 1.94k forks source link

Polars Lazyframe invalid operation error #18989

Closed pyjads closed 4 weeks ago

pyjads commented 1 month ago

Checks

Reproducible example

import polars as pl

df = pl.LazyFrame({
    'key1': ["a", "a", "b", "b", "c"],
    'key2': ["elephant", "elephant", "dog", "dog", "cat"],
    'values': [1, 2, 3, 4, 5]
})

mapping1 = {
    'a': 'apple',
    'b': 'banana',
    # 'c': 'carrot'
}

_df = df.with_columns([
    pl.when(pl.col("key1") == "a")
    .then(pl.col("key1").replace(mapping1))
    .otherwise(pl.col("key1"))
    .cast(pl.Utf8)
])

_df.sink_parquet("tmp.parquet")

Log output

Traceback (most recent call last):
  File "/Users/mysystem/vs-code/simple_scripts/anon/anon.py", line 41, in <module>
    _df.sink_parquet("tmp.parquet", )
  File "/Users/mysystem/vs-code/simple_scripts/.venv/lib/python3.11/site-packages/polars/_utils/unstable.py", line 58, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mysystem/vs-code/simple_scripts/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2388, in sink_parquet
    return lf.sink_parquet(
           ^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: sink_Parquet(ParquetWriteOptions { compression: Zstd(None), statistics: StatisticsOptions { min_value: true, max_value: true, distinct_count: false, null_count: true }, row_group_size: None, data_page_size: None, maintain_order: true }) not yet supported in standard engine. Use 'collect().write_parquet()'

Issue description

The code above raises an InvalidOperationError, although the same operation works correctly in version 0.20.3.

Expected behavior

The sink operation should have written the LazyFrame to the Parquet file.

Installed versions

``` --------Version info--------- Polars: 1.8.2 Index type: UInt32 Platform: macOS-13.4-arm64-arm-64bit Python: 3.11.3 (v3.11.3:f3909b8bc8, Apr 4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake 0.20.0 fastexcel fsspec gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 2.1.1 openpyxl pandas 2.2.2 pyarrow 17.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter None ```
ritchie46 commented 4 weeks ago

This isn't a bug. The error message prints a workaround. This query is not streaming.

pyjads commented 3 weeks ago

@ritchie46 But the code works with version 0.20.3. Were there any changes to the streaming implementation?

ritchie46 commented 3 weeks ago

To replace and this was changed in 1.0 where we allowed to break. The new streaming engine will eventually be able to do this, but it will take a while. For now I'd recommend collect().write_parquet.