Open patrick-schultz opened 5 months ago
Part 1 is #14590, making this some sort of default will be part 2.
Some discussion from 10/7
Maybe use fast external storage to keep and then query job results such that we never materialize all the results while the job is running.
The call caching framework may help here.
Spark breaks down when a job has too many partitions. We should modify the implementation of CollectDistributedArray on the Spark backend to automatically break up jobs that are above some threshold of number of partitions into a few sequential smaller jobs. This would have a large impact on groups like AoU who are using Hail on the biggest datasets, who currently have to hack around this issue with trial and error.