opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

LDIndex step fails silently #3037

Closed ireneisdoomed closed 1 year ago

ireneisdoomed commented 1 year ago

Describe the bug

The job that generates a LDIndex fails due to the executors silently dying. A list of the failing jobs (marked as cancelled) and their logs is available here.

The code for:

Observed behaviour

In all my runs, my experience was that the job was not completing because the executors at some point were dead, so the process would go on idle showing a log similar to this:

[Stage 4:>                                                       (0 + 88) / 200]
[Stage 4:>                                                       (0 + 88) / 200]

I've seen 2 types of errors when accessing the executor logs: 1. Executor disk usage 1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/1/hadoop/yarn/nm-local-dir : used space above threshold of 90.0% ] ;

My understanding is that this is basically telling that the disk used to store intermediate files has been filled. I think I've fixed this (more below).

2. Another one - I actually don't know what is going on now Some screenshots of the nodes health status while the cluster was alive: logs

And the latest logging error on the executor: log_exec

What my script is intended to do

One thing that is important to note is that this step is supposed to be run just once per release of data (every few years). We want to have something that runs in a considerate amount of time, but I am not aiming for the most optimal approach.

What I've done so far

  1. The whole step used to take ~8 hours. To make the process of debugging faster, I split the script in 2 steps based on the tasks described above:

    • Part that processes the inputs and write the unaggregated dataframe. This took ~4h. I was worried about the piece of logic that unioned all the dataframes together, but after optimising the logic, this was not the bottleneck (job).
    • From them on, I've only tried to run the logic that aggregates the data, and this is where I saw the nodes failing.
  2. I set 2 important parameters following Hail's documentation recommendations. This showed an improved performance (https://github.com/opentargets/genetics_etl_python/pull/110/commits/cf22242759fdff7bbdda24141d15a104b0bce15d)

    • openCostInBytes (warns the optimiser that the cost of opening a file is high) and maxPartitionBytes (defines the maximum size of a partition when reading data) set to 50gb.
  3. Given the large size of the data, attempted to repartition the data before aggregation to optimise performance. I partitioned by variantId, tagVariantId, and chromosome. I am pretty sure the executors are failing in this stage. When I set to redistribute the data in ~4000 partitions, the log said something like [Stage 136:> (0 + 88) / 4000]. My latest test repartitions the data based on those columns to a number of 10_000 partitions.

  4. When I was seeing errors in the disk usage of the node I tried optimising the cluster and the Spark session based on my needs:

This is a summary of the things I've tried and saw a benefit, despite not working. I would highly appreciate:

ireneisdoomed commented 1 year ago

The latest change I've done, and what I am testing at the moment is to repartition the data in 10,000 parts and persist the dataframe between the 2 groupings. This time I've sent the whole script, to make sure that the suboptimal way that I've written the unaggregated dataset is not interfering with latest improvements.

After 4 hours, logs show that it has recently died (job). I'll leave the cluster on for 30' in case you want to look at it @mbdebian

[Stage 48:=====================================>             (2337 + 88) / 3200]

[Stage 48:=====================================>              (2337 + 0) / 3200]

[Stage 48:=====================================>              (2337 + 0) / 3200]
ireneisdoomed commented 1 year ago

Success!! @d0choa suggested to throw more disk to the job and it ran in 5h (job) with a 5TB hard drive.

@mbdebian Could you still have a look at the changes I've made to make sure nothing is blatantly stinky? Changes are summarised in the PR and in the comment above.