[query] Extremely large jobs often run out of memory on the driver

hail-is / hail

Cloud-native genomic dataframes and batch computing

https://hail.is

MIT License

984 stars 246 forks source link

[query] Extremely large jobs often run out of memory on the driver #14584

Open patrick-schultz opened 5 months ago

patrick-schultz commented 5 months ago

Spark breaks down when a job has too many partitions. We should modify the implementation of CollectDistributedArray on the Spark backend to automatically break up jobs that are above some threshold of number of partitions into a few sequential smaller jobs. This would have a large impact on groups like AoU who are using Hail on the biggest datasets, who currently have to hack around this issue with trial and error.

chrisvittal commented 3 months ago

Part 1 is #14590, making this some sort of default will be part 2.

chrisvittal commented 1 month ago

Some discussion from 10/7

Maybe use fast external storage to keep and then query job results such that we never materialize all the results while the job is running.

The call caching framework may help here.