opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Cluster dies when running big workloads #3154

Closed tskir closed 7 months ago

tskir commented 7 months ago

As both @d0choa and I experienced, a Dataproc cluster tends to die when running really big workloads. This needs to be fixed.

tskir commented 7 months ago

I have performed a test ingestion run for FinnGen, and there are three significant improvements I can recommend based on that. I'll post them as separate comments for readability.

tskir commented 7 months ago

1. Use a higher RAM master machine

The type we had previously, n2-standard-4, only has 16 GB of RAM, which is not enough for the stable operation of the main Spark process. I have verified that using a more powered master machine, n2-highmem-8, allows for a very stable execution of a job: image

The red line occasionally spiking for the "YARN NodeManagers" plot is Google Compute Engine removing spot instances occasionally. We can see that, despite those perturbations, the job ran successfully to completion.

tskir commented 7 months ago

2. Split data into a higher number of partitions

On the screenshot we can see that YARN memory and CPU efficiency taper off towards the last third of the job execution; however, the nodes aren't being removed by ausoscaling.

This is happening because the node isn't removed while some partitions are still being processed on it. And for FinnGen, by default 1 partition = 1 input file, which is quite large and takes around 1.5 hours to get processed.

We can improve this by manually repartitioning the data to a higher number of partitions, so that one partition takes less time to process and automatically downscaling the cluster can be more effective.

tskir commented 7 months ago

3. Do not use spark_mode: overwrite even in a development setting

It so happened that shortly before the FinnGen ingestion job was successfully completed, Shelford experienced a power blip and my computer shut down.

This did not of course affect the job, as it was running on Dataproc.

However, as soon as I turned my computer back on (which was after the job has already successfully completed), Airflow started in the background. For some reason, it didn't recognise that the job it submitted earlier was successfully completed; and prompty went on to resubmit it to the cluster.

Of course, as soon as job started, it deleted everything under the designated output folder which was generated by the previous job :man_facepalming:

This mirrors another incident where @ireneisdoomed and I were running the same part of the code and so the job outputs were overwriting each other.

This leads me to believe that even in a development setting we shouldn't be using spark_mode: overwrite. It's better to manually delete the data and then re-run the generation, which prevents incidents like this from occurring.

d0choa commented 7 months ago

The single optimisation in finngen that would improve significantly performance is in the business logic. By providing the schema to the read.csv function, you would ensure that spark doesn't need to go through all files twice.

    raw_schema: t.StructType = StructType(
        [
            StructField("#chrom", StringType(), True),
            StructField("pos", StringType(), True),
            StructField("ref", StringType(), True),
            StructField("alt", StringType(), True),
            StructField("rsids", StringType(), True),
            StructField("nearest_genes", StringType(), True),
            StructField("pval", StringType(), True),
            StructField("mlogp", StringType(), True),
            StructField("beta", StringType(), True),
            StructField("sebeta", StringType(), True),
            StructField("af_alt", StringType(), True),
            StructField("af_alt_cases", StringType(), True),
            StructField("af_alt_controls", StringType(), True),
        ]
    )

This can be even more critical for other (bigger) datasets that come as .csv / .tsv.

DSuveges commented 7 months ago

he type we had previously, n2-standard-4, only has 16 GB of RAM, which is not enough for the stable operation of the main Spark process. I have verified that using a more powered master machine, n2-highmem-8, allows for a very stable execution of a job:

I have a slight concern with this approach: we might see unexpectedly that the n2-highmem-8 master might as well fail, which would imply we need an even bigger machine for that particular job. So the overally process is not robust.

d0choa commented 7 months ago

As far as I understand it, I think the memory is to handle the communication with the workers, not that much about the workload itself. Although nothing is black or white.

I think we all agree that Infrastructure settings to deal with infrastructure are good. Infrastructure changes that are fragile to changes in the workload are not that good.