Open Dekermanjian opened 6 months ago
Can you show log output?
@ritchie46 thank you for your response. I am experiencing something weird where the log output in a bash shell only says:
Terminated
If I run it in a Jupyter notebook then it just says that the kernel crashed.
After trying to trouble shoot it yesterday, I think that the streaming may actually be working but the data is just too large. The number of records in the table is 3,066,869,440. The problem is that this is a delta table that I am trying to read using scan_parquet()
so that I can utilize streaming, but doing so makes the data become astronomically large because it is reading in all historical versions of the data.
Do you have any suggestions or advice as to how I can go about streaming this large parquet files using polars? Is this beyond the scope of polars?
Currently we are not there yet. We are wroking on a new engine to be able to cope with that. Please give us some time. ;)
Thank you @ritchie46, I am excited to see the new engine when it is ready and available!
Just wondering if there's been any update to the streaming functionality, as I saw several other scan_parquet issues merged in, in the past twelve months or so, and am hoping improvements to stream might have made it in :)
It seems like streaming does not actually stream in a LazyFrame from a parquet source.
Any update on this, is there an associated feature or ticket we can monitor?
Checks
Reproducible example
Unfortunately, it is not quite straight forward to provide a copy-pasteable example. A faithful simulation of the problem requires several parquet files that are very large in size.
Log output
No response
Issue description
I am using
scan_parquet()
to scan several parquet files that exist in a directory. The data is ~ 35M records. Before collecting the lazy frame, I am trying to filter it down. This is the resultinglazyframe.explain()
output:This suggests that the filtering should work in streaming fashion. The result of the above should result in a dataframe with ~17K records. I keep running out of memory on a 64GB RAM machine. I have tried setting
pl.Config.set_streaming_chunk_size()
to a low value of 100 but I am still running out of memory. I am really lost and not sure what is going on. I have tried several versions of polars including the latest as of today version, but all of them ran out of memory.Any suggestions/help is appreciated.
Expected behavior
I expect to filter the data in streaming fashion avoiding out of memory crashes.
Installed versions