LDIndex step fails silently

ireneisdoomed commented 1 year ago

Describe the bug

The job that generates a LDIndex fails due to the executors silently dying. A list of the failing jobs (marked as cancelled) and their logs is available here.

The code for:

Observed behaviour

In all my runs, my experience was that the job was not completing because the executors at some point were dead, so the process would go on idle showing a log similar to this:

[Stage 4:>                                                       (0 + 88) / 200]
[Stage 4:>                                                       (0 + 88) / 200]

I've seen 2 types of errors when accessing the executor logs: 1. Executor disk usage 1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/1/hadoop/yarn/nm-local-dir : used space above threshold of 90.0% ] ;

My understanding is that this is basically telling that the disk used to store intermediate files has been filled. I think I've fixed this (more below).

2. Another one - I actually don't know what is going on now Some screenshots of the nodes health status while the cluster was alive: logs

And the latest logging error on the executor: log_exec

What my script is intended to do

The pipeline involves processing and aggregating a large dataset with around 30 billion rows. There are 2 primary tasks:
- Convert the raw inputs into our desired shape and combine them into a single dataframe, which represents a set of genetic variants correlated by Linkage Disequilibrium across populations. The DataFrame schema contains columns: r, variantId, chromosome, tagVariantId, and population.
- Aggregate this huge dataframe to produce a nested structure (more here). This nested structure aggregates r values for each tagVariantId and population under each variantId

One thing that is important to note is that this step is supposed to be run just once per release of data (every few years). We want to have something that runs in a considerate amount of time, but I am not aiming for the most optimal approach.

What I've done so far

The whole step used to take ~8 hours. To make the process of debugging faster, I split the script in 2 steps based on the tasks described above:
- Part that processes the inputs and write the unaggregated dataframe. This took ~4h. I was worried about the piece of logic that unioned all the dataframes together, but after optimising the logic, this was not the bottleneck (job).
- From them on, I've only tried to run the logic that aggregates the data, and this is where I saw the nodes failing.
I set 2 important parameters following Hail's documentation recommendations. This showed an improved performance (https://github.com/opentargets/genetics_etl_python/pull/110/commits/cf22242759fdff7bbdda24141d15a104b0bce15d)
- openCostInBytes (warns the optimiser that the cost of opening a file is high) and maxPartitionBytes (defines the maximum size of a partition when reading data) set to 50gb.
Given the large size of the data, attempted to repartition the data before aggregation to optimise performance. I partitioned by variantId, tagVariantId, and chromosome. I am pretty sure the executors are failing in this stage. When I set to redistribute the data in ~4000 partitions, the log said something like [Stage 136:> (0 + 88) / 4000]. My latest test repartitions the data based on those columns to a number of 10_000 partitions.
When I was seeing errors in the disk usage of the node I tried optimising the cluster and the Spark session based on my needs:
- To account for the local-dirs usable space error, I've changed this memory threshold to 98%. The script was going further but still failing. (https://github.com/opentargets/genetics_etl_python/pull/110/commits/34694342dd40630a589620b8bea84fbc968e22ac)
- To account for the disk space, I assigned a boot disk of 1TB (https://github.com/opentargets/genetics_etl_python/pull/110/commits/0eb317f7f92c133b9644f024667119b934786f95)
- I increased driver and executor memory, not to use the default parameters and allocate them based on the resources of the machine (9c8bc12)
- Even though there was a local SSD assigned, I saw that the disk type was still pd-standard. I explicitly told the cluster to use the SSD. (https://github.com/opentargets/genetics_etl_python/pull/110/commits/df70a0d295846b6da3158d3cb70e4ed19b81d050)
- The machine looked healthier. Same problems but setting non default parameters seemed to be doing something. I saw that the CPU usage was 30%. YARN assigns 2 executors per node by default. This seemed like a waste of resources, so I set parameters to assign a dynamic number of executors based on the available resources. I went from 2 to 11 executors + driver. The script was much much quicker, but executors kept dying at some point. (https://github.com/opentargets/genetics_etl_python/pull/110/commits/7c21ee530bfe715290e5fe816d378c89c170e30f)

This is a summary of the things I've tried and saw a benefit, despite not working. I would highly appreciate:

Optimisations in the business logic of the script. Should I partition the data in a specific way? Can I aggregate differently? Should I write intermediary files in a specific format to optimise the aggregation?
Feedback about my changes in how I've allocated resources. I know that fine tuning the best parameters is tricky, but I want to make sure things make more or less sense.
Should I use a bigger cluster?. The operation is intrinsically demanding. Would I benefit from having a cluster with multiple nodes? A cluster with more memory? With more disk?

ireneisdoomed commented 1 year ago

The latest change I've done, and what I am testing at the moment is to repartition the data in 10,000 parts and persist the dataframe between the 2 groupings. This time I've sent the whole script, to make sure that the suboptimal way that I've written the unaggregated dataset is not interfering with latest improvements.

After 4 hours, logs show that it has recently died (job). I'll leave the cluster on for 30' in case you want to look at it @mbdebian

[Stage 48:=====================================>             (2337 + 88) / 3200]

[Stage 48:=====================================>              (2337 + 0) / 3200]

[Stage 48:=====================================>              (2337 + 0) / 3200]

ireneisdoomed commented 1 year ago

Success!! @d0choa suggested to throw more disk to the job and it ran in 5h (job) with a 5TB hard drive.

@mbdebian Could you still have a look at the changes I've made to make sure nothing is blatantly stinky? Changes are summarised in the PR and in the comment above.

opentargets / issues