pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.93k stars 1.72k forks source link

"Chunk requires all its arrays to have an equal number of rows" error when writing a newly loaded parquet after adding a column #12679

Open mmcdermott opened 7 months ago

mmcdermott commented 7 months ago

Checks

Reproducible example

I have a very large dataframe (200M rows) with columns (timestamp and subject_id, among others) which is stored in a parquet file. I can read in the parquet file fine, via df = pl.scan_parquet(df_filepath).collect().

I can further then re-write the dataframe via df.write_parquet(df_filepath.parent / 'rewrite_test.parquet', use_pyarrow=True) and everything is fine.

But, if I add a column via filter_df = df.with_columns((pl.col('timestamp').is_not_null() & pl.col('subject_id').is_not_null()).alias('_filter')) then try to rewrite that dataframe (filter_df.write_parquet(df_filepath.parent / 'with_filter_col.parquet', use_pyarrow=True)), then I get a Panic Exception stating that the Chunk requires all arrays to have an equal number of rows.

Log output

No response

Issue description

See above.

Expected behavior

I expect it to be able to write the dataframe.

Installed versions

``` Polars: 0.18.15 python: 3.11.4 ```
mmcdermott commented 7 months ago

I've discovered that this error also occurs if you run df.head(len(df)) == df, even on the originally loaded df. It then says thread 'assertion failed:(left == right), withleftbeing the length of the dataframe andright` being a much larger number.

mmcdermott commented 7 months ago

Ok, I think this is because the dataframe at one point (before a prior filtering step) may have had more rows than are expressable with a uint32? The much larger number in the prior comment is 4.5T. Is something internal to polars storing row counts as uint32s?

cmdlineluser commented 7 months ago

https://github.com/pola-rs/polars#going-big

Do you expect more than 2^32 ~4,2 billion rows? Compile polars with the bigidx feature flag.

Or for Python users install pip install polars-u64-idx

mmcdermott commented 7 months ago

That is helpful, @cmdlineluser , but in the case where this dataframe was being used in my script, it already had well below the 4.2B rows, so I don't see why this should be required for my immediate use case.

ritchie46 commented 7 months ago

Can you add a repro? Then we can fix it.

mmcdermott commented 7 months ago

I will try to add a mwe after NeurIPS next week; unfortunately I can't share the actual example I'm working with as the data is proprietary, but I'll try to get something soon.

On Sun, Nov 26, 2023, 3:01 AM Ritchie Vink @.***> wrote:

Can you add a repro? Then we can fix it.

— Reply to this email directly, view it on GitHub https://github.com/pola-rs/polars/issues/12679#issuecomment-1826716303, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADS5X5LPBKHKXRFRJ4QLSDYGLZPLAVCNFSM6AAAAAA72AQJYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWG4YTMMZQGM . You are receiving this because you authored the thread.Message ID: @.***>