pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.42k stars 1.97k forks source link

Processing parquet results in pyo3_runtime.PanicException: mid > len #18436

Open Baukebrenninkmeijer opened 2 months ago

Baukebrenninkmeijer commented 2 months ago

Checks

Reproducible example

        for idx, file in tqdm(enumerate(file_list), total=len(file_list)):
            csv_df = pl.scan_csv(
                file,
                has_header=False,
                truncate_ragged_lines=True,
                schema=OLD_SCHEMA if 'march' in date_directory_name.lower() else None,
                ignore_errors=True,
            )
            frames.append(csv_df)
            if (idx > 0) and (idx % chunksize == 0) or (idx == (len(file_list) - 1)):
                combined_df = pl.concat(frames, how='vertical', parallel=True)
                columns = ['dt_col1', 'dt_col2', 'dt_col3']
                combined_df = combined_df.with_columns(
                    pl.from_epoch(pl.col(col_name), time_unit='ms').alias(col_name)
                    for col_name in columns
                    if combined_df.collect_schema().get(col_name) == pl.Int64())
                output_filename = output_dir / f'chunk_{chunk_idx}.parquet'
                combined_df.sink_parquet(output_filename)

Log output

thread 'polars-7' panicked at crates/polars-parquet/src/arrow/read/deserialize/binary/utils.rs:121:45:
mid > len
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: polars_parquet::arrow::read::deserialize::binary::decoders::deserialize_plain
   3: <polars_parquet::arrow::read::deserialize::binview::BinViewDecoder as polars_parquet::arrow::read::deserialize::utils::Decoder>::deserialize_dict
   4: polars_parquet::arrow::read::deserialize::simple::page_iter_to_array
   5: polars_io::parquet::read::read_impl::column_idx_to_series
   6: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
   7: rayon::iter::plumbing::bridge_producer_consumer::helper
   8: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   9: rayon_core::registry::WorkerThread::wait_until_cold
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
 33%|█████████████████████████████████████                                                                          | 2/6 [21:30<43:00, 645.16s/it]
Traceback (most recent call last):
  File "/Developer/ING/PSS_hardware_monitoring/pss/etl.py", line 374, in <module>
    Fire(ETL)
  File "/.pyenv/versions/3.12.5/envs/pss/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.12.5/envs/pss/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.12.5/envs/pss/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/Developer/PSS_hardware_monitoring/pss/etl.py", line 295, in raw_parquet_to_processed_parquet
    df.sink_parquet(out_file)
  File "/.pyenv/versions/3.12.5/envs/pss/lib/python3.12/site-packages/polars/_utils/unstable.py", line 58, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.12.5/envs/pss/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2351, in sink_parquet
    return lf.sink_parquet(
           ^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: mid > len

Issue description

It's unclear to me what this error means. I'm processing a total of 5 parquet files. This fails with nr. 3. I can read all input files fine with polars normally, so there doesn't seem to be a clear problem there.

Expected behavior

Successfully processes the data and writes to sink location.

Installed versions

``` --------Version info--------- Polars: 1.5.0 Index type: UInt32 Platform: macOS-14.6.1-arm64-arm-64bit Python: 3.12.5 (main, Aug 28 2024, 09:49:22) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: 0.18.2 fastexcel: fsspec: 2024.6.1 gevent: great_tables: hvplot: 0.10.0 matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 2.1.0 openpyxl: 3.1.2 pandas: 2.2.2 pyarrow: 16.1.0 pydantic: pyiceberg: sqlalchemy: torch: 2.4.0 xlsx2csv: xlsxwriter: 3.2.0```
coastalwhite commented 2 months ago

Can you check on 1.6.0?

Baukebrenninkmeijer commented 2 months ago

It does not happen when I explicitly pass a schema btw, even in 1.5. I'll see if I can reproduce it in 1.6 still.