Closed project-defiant closed 3 weeks ago
@ireneisdoomed you are right, we will be storing the shuffle partitions in the primary workers only in EFM mode, as far as the documentation is correct :)
The new strategy makes sense to me. I think it is interesting that we only see the issue in eCAVIAR since I consider COLOC to be more optimised (although it is more complex too).
This could have been non-deterministic. eCaviar took ~3h to complete, while coloc running at the same time failed to compute even after 4 hours. The initial issue, could have been that the fraction of executors running ecaviar were decomissioned in comparision to executors running coloc. It may be that one overlaps succeded, but the others failed. Either way we can not rely on the cluster in preemptible mode so easly, as any step that is running for a long time can be affected this way. Typically from my experiments, after ~30minutes the preemption can appears always, so any data that is not cached up until this stage, will get lost and has to be recomputed.
Another option to investigate is if we could actually retrieve the offending shuffle and cache or even checkpoint the data when before that stage. Then we would do what EFM tries to do, just manually from the code.
My only comment, correct me if I'm wrong, is that we are going to use the same cluster with the same EFM strategy for all ETL jobs, also for those where tasks don't involve heavy shuffling. Do you think removing graceful decommission from other steps will have a significant impact?
This is open to discussion. I am not sure if we will benefit anyhow on other tasks, since they are small and fast in comparission to colocalisation, but then we have to have two clusters, as EFM can be only set during the cluster setup, can not be updated.
Graceful decomissioning is there for downscaling the number of workers. As you might guess, the number of primary workers is fixed in the otg-efm
to 10, so we always will have these workers available due to the fact we need more space for the shuffle partitions. This inclines that we can not downscale the cluster primary workers, otherwise we still lose the shuffle partitions that were there. I am not aware if we can distinguish
between primary and secondary workers graceful decomissioning unfortunately.
Context
Introduction of the new SuSiE credible sets from
gwas_catalog
(gs://gwas_catalog_sumstats_susie/credible_set_clean) resulted inColocStep
failures. Job performed onotg-etl
cluster took ~4h and did not finish in that time - see job.The most of the error logs trace the fact that he executors got lost during the job execution.
The
otg-etl
cluster uses the following autoscaling policyWhich specifies the ratio of
secondaryWorkers
toprimaryWorkers
to be max100:2
.To accomodate for the lost shuffle partitions the EFM mode can be utilized.
The EFM mode will make the cluster to save the shuffle partitions only on primary workers. This will mean that we have to accomodate the disk size of the workers and effectively change the autoscaling policy, as EFM does not support the graceful decomissioning of workers.
Changes
The above tweaks were added to the existing dataproc cluster setup to accomodate the shuffling operations in Coloc step:
[x] Enable the EFM mode with
dataproc:efm.spark.shuffle=primary-workers
property[x] Increase the number of primary workers on EFM mode from 2 to 10
[x] Increase the ssd disk size on primary workers to 1TB when running on EFM mode
[x] Create new autoscaling policy without graceful decomissioning - otg-efm
[x] Allow for more then default concurrent threads to save the shuffle data into the primary workers (default is 16 cores * 2 which was adjusted to 50)
[x] Allow for more peer connections (from default 1 to 5)
[x] Tuning of shuffling with
All of above comes from reading the documentation on EFM
[x] Allow for 3 master nodes to run in parallel to decrese the risk of decomissioning the master node while the job tries to shuffle data to primary workers see High Availability mode
[x] Adjustments to the
create_cluster
function to accomodate all parameters fromClusterGenerator
Additionally
prerequisites
in the node configuration, if the tasks were commented out.