populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Joint genotyping fixes #773

Closed EddieLF closed 1 month ago

EddieLF commented 1 month ago

This PR adds the --merge-input-intervals flag back in for the GenomicsDB jobs, and leaves it commented out for the GenotypeGVCFs jobs.


Recently we made some changes to this pipeline to allow the GenotypeGVCFs jobs to skip over centromeres and telomeres. This included removing the --merge-input-intervals flag from the commands used by the GenomicsDB and GenotypeGVCFs jobs.

The reason this needed to be removed was because our GVCFs were created by genotyping all intervals, including centromeres and telomeres. So when the jobs "merged input intervals", the final interval list included those regions.

This has inadvertendly caused the GenomicsDB jobs to take an extremely long time to run. See the recent batch: https://batch.hail.populationgenomics.org.au/batches/454669 - the jobs are not even close to completing after several hours. There are also warnings at the top of the logs:

03:18:00.835 WARN  GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
03:18:01.080 WARN  GenomicsDBImport - GenomicsDBImport cannot use multiple VCF reader threads for initialization when the number of intervals is greater than 1. Falling back to serial VCF reader initialization.

Contrast this with the previous run, where the GenomicsDB jobs took only a few minutes to run: https://batch.hail.populationgenomics.org.au/batches/444407


I'm not really sure if this change will be enough to prevent the telomeres / centromeres flowing into the GenotypeGVCFs jobs that come after the GenomicsDB ones. One other possible change is to use the --exclude-intervals / -XL flag in the GenotypeGVCFs jobs. This way, we can pass our list of excluded regions (hg38.telomeresAndMergedCentromeres.interval_list) to each GenotypeGVCFs job, hard excluding the regions for every job. Maybe this part is not necessary, and all that's needed is the change to the GenomicsDB command. Throwing it out there for discussion. Link to GATK docs, cmd + f for "--exclude-intervals"

MattWellie commented 1 month ago

Yikes, I guess other than regenerating the gVCFs without the telo/centromeres we'll always be stuck trying to balance input intervals & new intervals?

EddieLF commented 1 month ago

Tentatively saying this has probably fixed most of the issue for exomes. The batch with the flag --merge-input-intervals for GenomicsDB but NOT for GenotypeGVCFs finished in a comparable time with comparable cost to the previous batch which had that flag set for both types of jobs.

It did seem to get a little bit more expensive, not sure if this is because of the additional 21 exomes or something else. I did see in the GenotypeGVCFs job logs, there are a lot of lines that look like this:

GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),5.4504E-5,Cpu time(s),5.394E-5
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.0013494870000000001,Cpu time(s),0.0012472829999999999
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),5.8396E-5,Cpu time(s),5.7501E-5

These lines were not present in the same jobs from the last batch. Not sure where they come from or what they mean.