Open DrMaphuse opened 11 months ago
Seems like the parquet file sink is using a batched writer under the hood. If I'm properly understanding how they work, Polars' batched parquet writers don't currently have any row group size settings, and instead just write row groups which are sized based on the DataFrames they receive. In this case, it seems like those DataFrames come from the streaming engine.
If the parquet BatchedWriter could have a row_group_size parameter, that would be very useful in general, and would also help resolve this issue. I'm not sure what the best way to implement that would be though. An intermediately-buffered DataFrame accumulating rows might work, but it might be pretty imprecise and would also risk running out of memory for huge row group sizes?
I hit this issue and dug a bit: the arrow
package works slightly differently: writing a RecordBatch
does not imply creating a row group. Instead, this is left to a .flush()
method. It looks like it wouldn't be very hard to adapt polars code to behave the same, or at least split write()
into a partial_write()
and flush()
operation, so that write()
preserves its current semantic.
Still the same issue with Polars 0.20.31.
Having the same issue on Polars 1.9.0.
I am trying to convert a csv to a parquet file with only one chunk for performance purposes via streaming to limit the max memory used.
The same options work for write_csv so this is a bit surprising.
In [3]: pl.scan_csv(
...: "~/Downloads/input.csv",
...: schema=schema,
...: infer_schema=False,
...: ).sink_parquet(
...: "~/Downloads/output.parquet",
...: compression="zstd",
...: row_group_size=sys.maxsize,
...: data_page_size=sys.maxsize,
...: )
...:
...: pl.read_parquet("~/Downloads/output.parquet").n_chunks()
Out[3]: 243
In [4]: pl.scan_csv(
...: "~/Downloads/input.csv",
...: schema=schema,
...: infer_schema=False,
...: ).collect().write_parquet(
...: "~/Downloads/output.parquet",
...: compression="zstd",
...: row_group_size=sys.maxsize,
...: data_page_size=sys.maxsize,
...: )
In [5]: pl.read_parquet("~/Downloads/output.parquet").n_chunks()
Out[5]: 1
I actually had a go at trying to fix it. Making the API more like arrow-rs was not very hard, but I stopped at the point where I needed to rewrite the row group metadata generation logic to amend an existing set of metadata rather than generate a whole new one. Otherwise, you end up with the row group metadata reflecting only the last chunk you added to it, which makes it completely wrong ofc.
Checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
No response
Issue description
The
row_group_size
parameter of thesink_parquet()
function does not appear to have any effect.Incidentally, the default row groups can also sometimes lead to inflated footer size (I've seen up to 50GB), which causes issues with some parquet readers that limit footer size to 16MB.
Expected behavior
Row groups should match the size passed in the argument.
Installed versions