Closed EddieLF closed 1 month ago
Yikes, I guess other than regenerating the gVCFs without the telo/centromeres we'll always be stuck trying to balance input intervals & new intervals?
Tentatively saying this has probably fixed most of the issue for exomes. The batch with the flag --merge-input-intervals
for GenomicsDB
but NOT for GenotypeGVCFs
finished in a comparable time with comparable cost to the previous batch which had that flag set for both types of jobs.
It did seem to get a little bit more expensive, not sure if this is because of the additional 21 exomes or something else. I did see in the GenotypeGVCFs
job logs, there are a lot of lines that look like this:
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),5.4504E-5,Cpu time(s),5.394E-5
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.0013494870000000001,Cpu time(s),0.0012472829999999999
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),5.8396E-5,Cpu time(s),5.7501E-5
These lines were not present in the same jobs from the last batch. Not sure where they come from or what they mean.
This PR adds the
--merge-input-intervals
flag back in for theGenomicsDB
jobs, and leaves it commented out for theGenotypeGVCFs
jobs.Recently we made some changes to this pipeline to allow the GenotypeGVCFs jobs to skip over centromeres and telomeres. This included removing the
--merge-input-intervals
flag from the commands used by theGenomicsDB
andGenotypeGVCFs
jobs.The reason this needed to be removed was because our GVCFs were created by genotyping all intervals, including centromeres and telomeres. So when the jobs "merged input intervals", the final interval list included those regions.
This has inadvertendly caused the
GenomicsDB
jobs to take an extremely long time to run. See the recent batch: https://batch.hail.populationgenomics.org.au/batches/454669 - the jobs are not even close to completing after several hours. There are also warnings at the top of the logs:Contrast this with the previous run, where the
GenomicsDB
jobs took only a few minutes to run: https://batch.hail.populationgenomics.org.au/batches/444407I'm not really sure if this change will be enough to prevent the telomeres / centromeres flowing into the
GenotypeGVCFs
jobs that come after theGenomicsDB
ones. One other possible change is to use the--exclude-intervals / -XL
flag in theGenotypeGVCFs
jobs. This way, we can pass our list of excluded regions (hg38.telomeresAndMergedCentromeres.interval_list
) to eachGenotypeGVCFs
job, hard excluding the regions for every job. Maybe this part is not necessary, and all that's needed is the change to the GenomicsDB command. Throwing it out there for discussion. Link to GATK docs, cmd + f for "--exclude-intervals"