Regression on large datasets using spot instances/preemptible VMs

projectglow / glow

An open-source toolkit for large-scale genomic analysis

Apache License 2.0

264 stars 111 forks source link

This is more of a spark question than a glow question. I'm running glow 0.6 on Google Cloud Platform. When running regressions on large datasets, each spark job takes 45+ minutes which makes the use of transient preemptible workers unfeasible. One workaround that I've tried is repartitioning the dataset to lower the number of records per partition. However, after writing out the df as parquet to smaller (non-empty) partitions and then reading it back in, I end up with fewer partitions than I had written out physically on to the storage, killing the point of the repartition step.

What spark configurations, can help me work around this? Are there any specific recommendations for partition size that you guys have? I'm running a vcf with ~40k samples and ~8M variants. I use n1-standard-4 VMs with 600Gb storage, spark.executor.memory=4g. All other configs are defaults.

Thank you!

Hey @skhalid7,

For spot market preemptions, spark is fault tolerant so if you do lose VMs the job will still complete. But 45 minutes per partition does seem like a very long time. If repartition isn't working, this may require some spark configuration tuning.

"spark_conf": { "spark.sql.parquet.columnarReaderBatchSize": 100 "spark.sql.execution.arrow.maxRecordsPerBatch": 100 }

These configurations change the number of rows per partition, limiting them to 100 rows max. By default spark does not expect so much data per row (due to the large genotypes array)

We are working on examples of how to run glow in a streaming fashion using features in Delta Lake. This is the longer term solution for running large scale genome-wide analytics. It will enable processing at the level of partition or individual variant (record). Then you need to worry less about spot market losses.

Also, please upgrade to Spark 3.1 and glow v1.1.0, so that we can compare results!

projectglow / glow

Regression on large datasets using spot instances/preemptible VMs #405