Open mmcdermott opened 7 months ago
I've discovered that this error also occurs if you run df.head(len(df)) == df
, even on the originally loaded df. It then says thread 'assertion failed:
(left == right), with
leftbeing the length of the dataframe and
right` being a much larger number.
Ok, I think this is because the dataframe at one point (before a prior filtering step) may have had more rows than are expressable with a uint32? The much larger number in the prior comment is 4.5T. Is something internal to polars storing row counts as uint32s?
https://github.com/pola-rs/polars#going-big
Do you expect more than 2^32 ~4,2 billion rows? Compile polars with the bigidx feature flag.
Or for Python users install
pip install polars-u64-idx
That is helpful, @cmdlineluser , but in the case where this dataframe was being used in my script, it already had well below the 4.2B rows, so I don't see why this should be required for my immediate use case.
Can you add a repro? Then we can fix it.
I will try to add a mwe after NeurIPS next week; unfortunately I can't share the actual example I'm working with as the data is proprietary, but I'll try to get something soon.
On Sun, Nov 26, 2023, 3:01 AM Ritchie Vink @.***> wrote:
Can you add a repro? Then we can fix it.
— Reply to this email directly, view it on GitHub https://github.com/pola-rs/polars/issues/12679#issuecomment-1826716303, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADS5X5LPBKHKXRFRJ4QLSDYGLZPLAVCNFSM6AAAAAA72AQJYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWG4YTMMZQGM . You are receiving this because you authored the thread.Message ID: @.***>
Checks
[X] I have checked that this issue has not already been reported.
[ ] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
I have a very large dataframe (200M rows) with columns (
timestamp
andsubject_id
, among others) which is stored in a parquet file. I can read in the parquet file fine, viadf = pl.scan_parquet(df_filepath).collect()
.I can further then re-write the dataframe via
df.write_parquet(df_filepath.parent / 'rewrite_test.parquet', use_pyarrow=True)
and everything is fine.But, if I add a column via
filter_df = df.with_columns((pl.col('timestamp').is_not_null() & pl.col('subject_id').is_not_null()).alias('_filter'))
then try to rewrite that dataframe (filter_df.write_parquet(df_filepath.parent / 'with_filter_col.parquet', use_pyarrow=True)
), then I get a Panic Exception stating that the Chunk requires all arrays to have an equal number of rows.Log output
No response
Issue description
See above.
Expected behavior
I expect it to be able to write the dataframe.
Installed versions