Closed jmakov closed 7 months ago
It seems we cannot finish the query because we don't have enough space? The data we sort must be stored somewhere.
Thanks for the quick response. Makes sense. However wondering how come that even if I don't select the sorted column in the resulting query, /tmp/polars/sort
takes a lot of space e.g. 25GB. In addition it looks like the sorted column takes only 570MB of space: source.select("timestamp").collect(streaming=True).estimated_size("mb")
returns 570
. Looks to me that when we sort by 1 column, the data of the whole DF is written somewhere and then sorted, making it impossible to sorting larger than memory data. Or am I missing something?
If that's the case - that we need to have the whole DF sorted in memory, how would then one approach sorting larger than mem DFs, especially when you do e.g. sum_horizontal
and require only 1 resulting column (which fits into memory)? Is there a way to make sure scan_parquet
reads partitions in sorted order so that sort("timestamp")
wouldn't need to be called?
Checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
Traceback:
Issue description
The query executes fine for about 1 month of data. If more data is selected, the above panic happens (even with
low_memory=True
. Sometimes, even when it doesn't panic, I see a lot of data being written to/tmp/polars/sort
.Expected behavior
According to docs, should work
Installed versions