populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Add --exclude-intervals h38 telomeres centromeres to genotypegvcfs jobs #780

Closed EddieLF closed 3 weeks ago

EddieLF commented 4 weeks ago

Localises the hg38telomeresandcentromeres.interval_list file for every GenotypeGVCFs job and passes it to the option --exclude-intervals.

In theory, this should ensure that even though we --merge-input-intervals for the GenomicsDB jobs, we do not traverse the telomeres and centromeres during the GenotypeGVCFs jobs. This should give us the best of both worlds. Each GenomicsDB fragment will be created quickly as it merges intervals, and each GenotypeGVCFs fragment will run quickly as it skips over the problematic regions of the genome.


It's unclear to me if this is actually necessary. I did two trial batches:

Using this branch: https://batch.hail.populationgenomics.org.au/batches/455152

Using the main branch: https://batch.hail.populationgenomics.org.au/batches/455159

I went through each GenotypeGVCFs job in both batches, and neither of them ever traverse the telomeres/centromeres. Perhaps this is because the number of genomes is too low, and there is no coverage in those regions when n=3. However it might also indicate that we're fine with things the way they are, and telomeres / centromeres will not be traversed with the current code.

EddieLF commented 4 weeks ago

@cassimons @MattWellie tagging you in for visibility - I don't think we need to merge this yet. I think we should give it a go with the current code and monitor if the GenotypeGVCFs jobs are getting stuck on the centromeres, at which point we could give this code change a try.

cassimons commented 4 weeks ago

In your trial jobs were the gvcfs called with the black list? If so then we expect thais to run fine either way. Our problem is that a heap of our existing gvcfs were called without the black list so we will hit the same problem we always used to in centromeres.

EddieLF commented 4 weeks ago

@cassimons good point. Those 3 genomes in the batches from my previous comment had been called with the blacklisted regions.

I reran the test with 5 other genomes, whose gvcf files were created in April 2023, before we were blacklisting any regions.

Using this branch: https://batch.hail.populationgenomics.org.au/batches/455175

Using the main branch: https://batch.hail.populationgenomics.org.au/batches/455178

I didn't thoroughly check every chromosome, but the centromeres don't appear to be traversed.