projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

Set explicit guidelines when running large jobs #459

Open williambrandler opened 2 years ago

williambrandler commented 2 years ago

Spark jobs with lots of partitions can crash if the driver is too small, for example with the error,

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 731259 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

Set explicit guidelines for each job about cluster setup.

This will come once we have the continuous integration pipeline running on multitask jobs with a different setup for each use case (ingest vs etl vs regressions etc)