Closed ireneisdoomed closed 1 year ago
The latest change I've done, and what I am testing at the moment is to repartition the data in 10,000 parts and persist the dataframe between the 2 groupings. This time I've sent the whole script, to make sure that the suboptimal way that I've written the unaggregated dataset is not interfering with latest improvements.
After 4 hours, logs show that it has recently died (job). I'll leave the cluster on for 30' in case you want to look at it @mbdebian
[Stage 48:=====================================> (2337 + 88) / 3200]
[Stage 48:=====================================> (2337 + 0) / 3200]
[Stage 48:=====================================> (2337 + 0) / 3200]
Describe the bug
The job that generates a LDIndex fails due to the executors silently dying. A list of the failing jobs (marked as cancelled) and their logs is available here.
The code for:
Observed behaviour
In all my runs, my experience was that the job was not completing because the executors at some point were dead, so the process would go on idle showing a log similar to this:
I've seen 2 types of errors when accessing the executor logs: 1. Executor disk usage
1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/1/hadoop/yarn/nm-local-dir : used space above threshold of 90.0% ] ;
My understanding is that this is basically telling that the disk used to store intermediate files has been filled. I think I've fixed this (more below).
2. Another one - I actually don't know what is going on now Some screenshots of the nodes health status while the cluster was alive:
And the latest logging error on the executor:
What my script is intended to do
What I've done so far
The whole step used to take ~8 hours. To make the process of debugging faster, I split the script in 2 steps based on the tasks described above:
I set 2 important parameters following Hail's documentation recommendations. This showed an improved performance (https://github.com/opentargets/genetics_etl_python/pull/110/commits/cf22242759fdff7bbdda24141d15a104b0bce15d)
openCostInBytes
(warns the optimiser that the cost of opening a file is high) andmaxPartitionBytes
(defines the maximum size of a partition when reading data) set to 50gb.Given the large size of the data, attempted to repartition the data before aggregation to optimise performance. I partitioned by
variantId, tagVariantId, and chromosome
. I am pretty sure the executors are failing in this stage. When I set to redistribute the data in ~4000 partitions, the log said something like[Stage 136:> (0 + 88) / 4000]
. My latest test repartitions the data based on those columns to a number of 10_000 partitions.When I was seeing errors in the disk usage of the node I tried optimising the cluster and the Spark session based on my needs:
local-dirs usable space
error, I've changed this memory threshold to 98%. The script was going further but still failing. (https://github.com/opentargets/genetics_etl_python/pull/110/commits/34694342dd40630a589620b8bea84fbc968e22ac)pd-standard
. I explicitly told the cluster to use the SSD. (https://github.com/opentargets/genetics_etl_python/pull/110/commits/df70a0d295846b6da3158d3cb70e4ed19b81d050)This is a summary of the things I've tried and saw a benefit, despite not working. I would highly appreciate: