pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.98k stars 1.93k forks source link

sink_parquet throws pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value #19273

Open tsoernes opened 1 week ago

tsoernes commented 1 week ago

Checks

Reproducible example

    master = pl.scan_parquet(out_path, low_memory=True).select(columns=columns)
    paths = [
        p for p in activity_embeddings_path.parent.listdir() if str(p)[-1].isnumeric()
    ]

    for p in paths:
        df = pl.scan_parquet(p, low_memory=True).select(columns=columns)
        master = pl.concat([master, df])
        master = master.unique(subset=key_column)
        master.sink_parquet("/tmp/xx.parquet")
        p.unlink()

Log output

No response

Issue description

POLARS PREFETCH_SIZE: 24
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-pipe/src/pipeline/convert.rs:409:88:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/torstein/code/nysno/nysno/embedding.py", line 531, in <module>
    merge_temp()
  File "/home/torstein/code/nysno/nysno/embedding.py", line 513, in merge_temp
    master.sink_parquet("/tmp/xx.parquet")
  File "/home/torstein/code/nysno/.venv/lib/python3.12/site-packages/polars/_utils/unstable.py", line 58, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/torstein/code/nysno/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2385, in sink_parquet
    return lf.sink_parquet(
           ^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

Expected behavior

Should concat parquets

Installed versions

``` --------Version info--------- Polars: 1.9.0 Index type: UInt32 Platform: Linux-6.11.3-200.fc40.x86_64-x86_64-with-glibc2.39 Python: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0] ----Optional dependencies---- adbc_driver_manager 1.2.0 altair 5.4.1 cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio numpy 2.1.2 openpyxl pandas 2.2.3 pyarrow 17.0.0 pydantic 2.9.2 pyiceberg sqlalchemy 2.0.36 torch xlsx2csv xlsxwriter ```
cmdlineluser commented 1 week ago

Do you have an example that can be run?