pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.72k stars 1.9k forks source link

Cross join LazyFrames with sink_parquet results in panic #12498

Closed sebastienvercammen closed 11 months ago

sebastienvercammen commented 11 months ago

Checks

Reproducible example

q = pl.scan_parquet("0.parquet")
q = q.join(pl.scan_parquet("1.parquet"), how="cross")
q = q.join(pl.scan_parquet("2.parquet"), how="cross")
q.sink_parquet("out-polars.parquet")

Log output

thread '<unnamed>' panicked at C:\a\polars\polars\crates\polars-arrow\src\chunk.rs:20:31:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("Chunk require all its arrays to have an equal number of rows"))
stack backtrace:
   0:     0x7fff5dda4643 - ffi_select_with_compiled_path
   1:     0x7fff5b426a5d - BrotliDecoderVersion
   2:     0x7fff5dd88151 - ffi_select_with_compiled_path
   3:     0x7fff5dda655a - ffi_select_with_compiled_path
   4:     0x7fff5dda6179 - ffi_select_with_compiled_path
   5:     0x7fff5dda71f7 - ffi_select_with_compiled_path
   6:     0x7fff5dda6cf5 - ffi_select_with_compiled_path
   7:     0x7fff5dda6c39 - ffi_select_with_compiled_path
   8:     0x7fff5dda6c24 - ffi_select_with_compiled_path
   9:     0x7fff5dedae17 - ffi_select_with_compiled_path
  10:     0x7fff5dedb413 - ffi_select_with_compiled_path
  11:     0x7fff5d45665a - ffi_select_with_compiled_path
  12:     0x7fff5d4522d3 - ffi_select_with_compiled_path
  13:     0x7fff5d450ae7 - ffi_select_with_compiled_path
  14:     0x7fff5dda163c - ffi_select_with_compiled_path
  15:     0x7ff9278b7344 - BaseThreadInitThunk
  16:     0x7ff928a226b1 - RtlUserThreadStart
thread '<unnamed>' panicked at crates\polars-pipe\src\executors\sinks\file_sink.rs:298:14:
called `Result::unwrap()` on an `Err` value: Any { .. }
stack backtrace:
   0:     0x7fff5dda4643 - ffi_select_with_compiled_path
   1:     0x7fff5b426a5d - BrotliDecoderVersion
   2:     0x7fff5dd88151 - ffi_select_with_compiled_path
   3:     0x7fff5dda655a - ffi_select_with_compiled_path
   4:     0x7fff5dda6179 - ffi_select_with_compiled_path
   5:     0x7fff5dda71f7 - ffi_select_with_compiled_path
   6:     0x7fff5dda6cf5 - ffi_select_with_compiled_path
   7:     0x7fff5dda6c39 - ffi_select_with_compiled_path
   8:     0x7fff5dda6c24 - ffi_select_with_compiled_path
   9:     0x7fff5dedae17 - ffi_select_with_compiled_path
  10:     0x7fff5dedb413 - ffi_select_with_compiled_path
  11:     0x7fff5d478914 - ffi_select_with_compiled_path
  12:     0x7fff5d4f06bc - ffi_select_with_compiled_path
  13:     0x7fff5c9a652a - ffi_select_with_compiled_path
  14:     0x7fff5d89fbbe - ffi_select_with_compiled_path
  15:     0x7fff5c9b9a87 - ffi_select_with_compiled_path
  16:     0x7fff5b188eac - <unknown>
  17:     0x7fff5b1a080e - <unknown>
  18:     0x7fff5aa19a3c - <unknown>
  19:     0x7fff5b2d92d1 - PyInit_polars
  20:     0x7ff8b61f987f - PyComplex_AsCComplex
  21:     0x7ff8b61ee659 - PyBytes_Repeat
  22:     0x7ff8b61eebd1 - PyObject_Vectorcall
  23:     0x7ff8b62e6c5a - PyEval_EvalFrameDefault
  24:     0x7ff8b62eaa4e - PyEval_EvalFrameDefault
  25:     0x7ff8b62e2180 - PyEval_EvalCode
  26:     0x7ff8b6361e1e - PyRun_FileExFlags
  27:     0x7ff8b6361ef8 - PyRun_FileExFlags
  28:     0x7ff8b6361a28 - PyRun_StringFlags
  29:     0x7ff8b635f5f5 - PyRun_SimpleFileObject
  30:     0x7ff8b635e864 - PyRun_AnyFileObject
  31:     0x7ff8b6170abc - Py_gitidentifier
  32:     0x7ff8b6171493 - Py_gitidentifier
  33:     0x7ff8b6171830 - Py_Main
  34:     0x7ff6b7d11494 - OPENSSL_Applink
  35:     0x7ff9278b7344 - BaseThreadInitThunk
  36:     0x7ff928a226b1 - RtlUserThreadStart
Traceback (most recent call last):
  File "D:\repos\pwatch\duckdb vs parquet\start.py", line 66, in <module>
    with_polars()
  File "D:\repos\pwatch\duckdb vs parquet\start.py", line 50, in with_polars
    q.sink_parquet("out-polars.parquet")
  File "...\Lib\site-packages\polars\lazyframe\frame.py", line 2031, in sink_parquet
    return lf.sink_parquet(
           ^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Any { .. }

Issue description

With lazy loading, streaming, and the scan_* methods, I was trying to compare polars to alternatives for larger-than-memory cross joins into Parquet. I'm new to polars, so apologies if I'm misunderstanding anything, but a basic cross join on more than one table seems to fail.

For testing's sake, the different Parquet files all contain exactly 1 column and 10 rows, all with auto-generated unique strings of 7 characters long.

Expected behavior

I expected polars to cross join AxBxC.

Installed versions

``` >>> pl.show_versions() --------Version info--------- Polars: 0.19.13 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: 2023.10.0 gevent: matplotlib: numpy: 1.26.2 openpyxl: pandas: 2.1.3 pyarrow: 14.0.1 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 11 months ago

Can reproduce:

import tempfile
import polars as pl

A = tempfile.NamedTemporaryFile()
B = tempfile.NamedTemporaryFile()

rows = pl.int_range(0, 2)

pl.select(A = rows).write_parquet(A.name)
pl.select(B = rows).write_parquet(B.name)

q = pl.scan_parquet(A.name)
q = q.join(pl.scan_parquet(B.name), how="cross")

q.sink_parquet("out.parquet")

# thread '<unnamed>' panicked at /Users/user/git/polars/crates/polars-arrow/src/chunk.rs:20:31:
# called `Result::unwrap()` on an `Err` value: 
# ComputeError(ErrString("Chunk require all its arrays to have an equal number of rows"))
# thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sinks/file_sink.rs:298:14:
cmdlineluser commented 11 months ago

@ritchie46 The docs for .collect(streaming=) have a warning:

This functionality is currently in an alpha state.

Is the same true for .sink_parquet()?

I wasn't sure if it was correct or not for me to state that in my initial response.