Open DeflateAwning opened 5 months ago
Would really appreciate a repro here. This doesn't ring a bell.
The situation seems to be that, depending on the compute environment configuration (amount of RAM), as the computation is running and the system is running out of memory, it either straight runs out of memory, or runs into this bug where this error occurs. I've been unable to create a reliable repro though; will keep trying.
Here's the repro I'm working with, but it doesn't fail reliably, and is highly dependent on available RAM:
from pathlib import Path
import polars as pl
import random
# generate 4000 parquets, each with 100k rows
file_count = 4000
(input_folder_path := Path('./temp_RERPO_16227')).mkdir(exist_ok=True, parents=True)
print(f"Writing to: {input_folder_path.absolute()}")
for file_num in range(file_count):
row_count = 100_000
df = pl.DataFrame(
{
"file_num": pl.Series([file_num] * row_count),
"random_num": pl.Series([random.randint(1, 100_000) for _ in range(row_count)]),
"random_num_1": pl.Series([random.randint(1, 100_000) for _ in range(row_count)]),
"random_num_2": pl.Series([random.randint(1, 100_000) for _ in range(row_count)]),
"random_num_3": pl.Series([random.randint(1, 100_000) for _ in range(row_count)]),
"random_num_4": pl.Series([random.randint(1, 100_000) for _ in range(row_count)]),
"random_num_5": pl.Series([random.randint(1, 100_000) for _ in range(row_count)]),
'col1': pl.Series(["123", "abc", "xyz"]).sample(row_count, with_replacement=True),
'col2': pl.Series(["123", "abc", "xyz"]).sample(row_count, with_replacement=True),
'col3': pl.Series(["123", "abc", "xyz"]).sample(row_count, with_replacement=True),
'col4': pl.Series(["123", "abc", "xyz"]).sample(row_count, with_replacement=True),
}
).with_row_index("orig_row_number")
df.write_parquet(input_folder_path / f"in_file_{file_num}.parquet")
print(f"Made parquet {file_num + 1}/{file_count}")
print(f"Made {file_count:,} parquets. Total size: {sum(f.stat().st_size for f in input_folder_path.glob('*.parquet')):,} bytes")
# then concat them all into one big parquet
(output_folder_path := Path('./temp_RERPO_16227_output')).mkdir(exist_ok=True, parents=True)
output_path = output_folder_path / "out_file.parquet"
dfs = [
pl.scan_parquet(f)
for f in input_folder_path.glob("*.parquet")
]
pl.concat(dfs).sort("random_num").sink_parquet(output_path)
print(f"Concatenated {file_count:,} parquets into one big parquet. Total size: {output_path.stat().st_size:,} bytes")
Checks
Reproducible example
My appologies that I don't have a minumum reproducable example, as the failure seems somewhat random.
Log output
Issue description
Calling that function results in successes for 20+ chunks (hundreds of files), and then encounters an error with no indication of what went wrong.
The error is:
ComputeError: buffer's length is too small in mmap
Expected behavior
Function should work.
Installed versions