Closed tskir closed 7 months ago
I have performed a test ingestion run for FinnGen, and there are three significant improvements I can recommend based on that. I'll post them as separate comments for readability.
The type we had previously, n2-standard-4
, only has 16 GB of RAM, which is not enough for the stable operation of the main Spark process. I have verified that using a more powered master machine, n2-highmem-8
, allows for a very stable execution of a job:
The red line occasionally spiking for the "YARN NodeManagers" plot is Google Compute Engine removing spot instances occasionally. We can see that, despite those perturbations, the job ran successfully to completion.
On the screenshot we can see that YARN memory and CPU efficiency taper off towards the last third of the job execution; however, the nodes aren't being removed by ausoscaling.
This is happening because the node isn't removed while some partitions are still being processed on it. And for FinnGen, by default 1 partition = 1 input file, which is quite large and takes around 1.5 hours to get processed.
We can improve this by manually repartitioning the data to a higher number of partitions, so that one partition takes less time to process and automatically downscaling the cluster can be more effective.
It so happened that shortly before the FinnGen ingestion job was successfully completed, Shelford experienced a power blip and my computer shut down.
This did not of course affect the job, as it was running on Dataproc.
However, as soon as I turned my computer back on (which was after the job has already successfully completed), Airflow started in the background. For some reason, it didn't recognise that the job it submitted earlier was successfully completed; and prompty went on to resubmit it to the cluster.
Of course, as soon as job started, it deleted everything under the designated output folder which was generated by the previous job :man_facepalming:
This mirrors another incident where @ireneisdoomed and I were running the same part of the code and so the job outputs were overwriting each other.
This leads me to believe that even in a development setting we shouldn't be using spark_mode: overwrite. It's better to manually delete the data and then re-run the generation, which prevents incidents like this from occurring.
The single optimisation in finngen that would improve significantly performance is in the business logic. By providing the schema to the read.csv
function, you would ensure that spark doesn't need to go through all files twice.
raw_schema: t.StructType = StructType(
[
StructField("#chrom", StringType(), True),
StructField("pos", StringType(), True),
StructField("ref", StringType(), True),
StructField("alt", StringType(), True),
StructField("rsids", StringType(), True),
StructField("nearest_genes", StringType(), True),
StructField("pval", StringType(), True),
StructField("mlogp", StringType(), True),
StructField("beta", StringType(), True),
StructField("sebeta", StringType(), True),
StructField("af_alt", StringType(), True),
StructField("af_alt_cases", StringType(), True),
StructField("af_alt_controls", StringType(), True),
]
)
This can be even more critical for other (bigger) datasets that come as .csv
/ .tsv
.
he type we had previously,
n2-standard-4
, only has 16 GB of RAM, which is not enough for the stable operation of the main Spark process. I have verified that using a more powered master machine,n2-highmem-8
, allows for a very stable execution of a job:
I have a slight concern with this approach: we might see unexpectedly that the n2-highmem-8
master might as well fail, which would imply we need an even bigger machine for that particular job. So the overally process is not robust.
As far as I understand it, I think the memory is to handle the communication with the workers, not that much about the workload itself. Although nothing is black or white.
I think we all agree that Infrastructure settings to deal with infrastructure are good. Infrastructure changes that are fragile to changes in the workload are not that good.
As both @d0choa and I experienced, a Dataproc cluster tends to die when running really big workloads. This needs to be fixed.