Open dhruvyy opened 6 months ago
Just for a bit more context here as well from some more testing.
I'm running the Polars code from the above plan on a Mac with 96GB Ram with a 50m row input set.
I watch OSX stick it into RAM until a certain point and then it flips it over to the swap partition which is 90GB in size and when it hits 90GB OSX just kills it.
I've also set the parquet scan to low_memory true and cache false to try and limit the ram usage on the input size. No obvious difference. I also dialed the streaming chunk size down to like 1000 rows and it still blows up.
The inner join is a killer. But, https://github.com/pola-rs/polars/issues/14201 taking this and casting the join key columns, does stop it blowing up. Further down we do a few selects and then a pl.concat over two of these frames and that also causes this to blow up, where as sinking to disk instead of concating seems to not do that.
Still on the RAM usage, I'm curious because when it doesn't work, it doesn't seem to do much spilling to disk and instead seems to more fill up the Swap until it hits 90GB and gets killed.
Yeah so I doubled the size again for my input and something like this:
df1 = pl.scan_parquet("output_data/tmpbase.parquet").cast({"owning_account_id": pl.Categorical})
df2 = pl.scan_parquet("output_data/tmpintermediate.parquet").cast({"transaction_id": pl.Categorical})
return df1.join(
df2,
on=["owning_account_id", "transaction_id"],
how="inner",
)
....
result[table].sink_ipc(f"output_data/{table}")
goes boom. I'm not sure whats up, but I can't keep an inner join streamable and the confines of the Ram limits
Flipping the inner join to left join on the same dataset doesn't seem to use more than about 20% of available RAM, so I'm surprised an inner join uses 90GB Swap and blows up.
df1 = pl.scan_parquet("output_data/tmpbase.parquet", low_memory=True, cache=False).cast({"owning_account_id": pl.Categorical})
df2 = pl.scan_parquet("output_data/tmpintermediate.parquet", low_memory=True, cache=False).cast({"transaction_id": pl.Categorical})
df1.join(
df2,
on=["owning_account_id", "transaction_id"],
how="left",
).filter(pl.col("transaction_id").is_not_null()).sink_parquet("output_data/table.parquet", maintain_order=False, row_group_size=1000)
This is where I'm now at, doing a left join followed by a filter to remove the other records left as its supposed to be an inner. Seems to currently live at about 40% of my ram, rather than OOMing
@buggtb Thanks for this thread, it was helpful since I was dealing with the same problem - inner join during streaming filling whole memory and then being killed. Left join seems to have fixed the problem for now, but I'm unclear as to why this was happening as well.
I've run into a very similar issue with streaming mode running out of memory, but on a much simpler query:
pl.scan_parquet("./data/*.parquet", cache=False, low_memory=True).filter(
pl.col("id").is_in(include_ids)
).sink_parquet(output_path)
It's scanning about 110GB, across ~2000 parquet files ranging in size from a few MB to 1GB, on a machine with 64GB RAM. It gets most of the way through before the memory spikes significantly and then crashes.
Didn't see any noticable difference in behaviour with or without low_memory
, but setting POLARS_MAX_THREADS=12
(down from the machine default of 24) allows it to run in full. I'm relatively new to polars, so I'm not sure how expected/well-known this behaviour is, but I'm guessing it's hitting a cluster of the larger files and decompressing too many in parallel?
@gtebbutt this seems to be lazy but not streaming. Meaning that in the end, this will try to load all matching files into memory still. Could you add '.collect(streaming=True)' before sink_parquet and try again?
I get the same issue. This may not be required due to it running in rust (no GC), but I know in python a manual garbage collection process is needed when streaming over arrow datasets otherwise memory overflows. Is something similar needed here , when running the python polars library?
I watch OSX stick it into RAM until a certain point and then it flips it over to the swap partition which is 90GB in size and when it hits 90GB OSX just kills it.
I observe the same behavior on my device when doing:
pl.scan_csv('source.csv').sink_parquet('dest.parquet')
I don't know why 90GB is so special.
Checks
Reproducible example
Not possible to provide.
Log output
No response
Issue description
I have a pipeline that is entirely streaming compliant and confirmed by running the
.explain(streaming=True)
method. Large datasets are run against this pipeline and the process is simply killed due to OOM issues, which by definition should not happen.Data points
Questions
Expected behavior
The pipeline should run end to end (albeit slowly) and not run into OOM issues, especially since the streaming plan confirms everything is streaming compliant.
Installed versions