sr320 / ceabigr

Workshop on genomic data integration with a emphasis on epigenetic data (FHL 2022)
4 stars 2 forks source link

Increasing threads in GATK? #58

Closed sr320 closed 2 years ago

sr320 commented 2 years ago

As @laurahspencer warned, the following takes a long time (another month)

echo "Calling variants using HaplotypeCaller"
for file in *dedup-split-RG.bam
do
sample="$(basename -a $file | cut -d "." -f 1)"

/home/shared/gatk-4.2.5.0/gatk HaplotypeCaller \
-R /home/shared/8TB_HDD_01/sr320/github/ceabigr/data/Cvirginica_v300.fa \
-I $sample.dedup-split-RG.bam \
-O $sample.variants.g.vcf \
-ERC GVCF
done >> HaplotypeCaller_stout.txt 2>&1

Is there a way I can increase # threads / memory? I found a few options online but just curious if @ksil91 or others have recommendations before I go down the path of installing new stuff.

laurahspencer commented 2 years ago

The MarineOmics website has a good option that I haven't yet tested but have also seen mentioned on various forums - here's the site https://marineomics.github.io/fastq2vcf_pipeline.html and here's the pertinent snippet:

Two of the major variant callers you can use (at the time of writing) are Genome Analysis Toolkit (GATK) and freebayes. You can call variants with either of these software in our pipeline and the downstream steps are the same, so here we present parameters for variant calling with both GATK and freebayes. Regardless of the software used, variant calling is computationally intensive but can be parallelized for more efficient resource use by splitting the reference genome into intervals. The interval creation is automated in our pipeline but parameter guidelines will differ depending on your reference genome.

To call germline single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) via local re-assembly of haplotypes using GATK:

gatk HaplotypeCaller -R {reference.fa} -I {input.bam} -O {output.gvcf} -L {interval.list} --emit-ref-confidence GVCF --min-pruning 2 --min-dangling-branch-length 4

Where -R is the reference genome -I is the input BAM -O is the output gVCF -L is the interval file --emit-ref-confidence yields the reference confidence scores as gVCF --min-pruning sets the minimum support to not prune paths in the graph --min-dangling-branch-length sets the minimum lenght of a dangling branch to attempt recovery

ksil91 commented 2 years ago

Agree about splitting into intervals. I haven't needed to do it myself but it seems to be commonly used. You can also request more threads with this (here asking for 5):

--native-pair-hmm-threads 5

sr320 commented 2 years ago

Which is better, intervals or increasing threads (or both). And for intervals would I just do something like create a bed file of 10kbp windows of entire genome? Most examples I have seen seem to focus on selecting specific regions.

laurahspencer commented 2 years ago

For the interval option, it looks like that script I sent you already preps and uses an interval list for the gatk GenomicsDBImport step (next one after HaplotypeCaller).

# create interval list (just a list of all contigs in genome)
echo "Creating intervals list"
awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' ${REF}/Paralithodes_platypus_genome.fasta.fai > intervals.bed
laurahspencer commented 2 years ago

I haven't gotten to that step, yet, in my current job so can't confirm that it's foolproof - I'd double check that it's doing what you want before using that code.

sr320 commented 2 years ago

FTR increasing threads really does not speed anything up. I have yet to try intervals. I will try just intervalling into chromosomes (~11) and see if that makes a difference.

sr320 commented 2 years ago

note that using intervals in this case is slower... eg

source /gscratch/srlab/programs/scripts/paths.sh

for file in *dedup-split-RG.bam
do
sample="$(basename -a $file | cut -d "." -f 1)"

/gscratch/srlab/programs/gatk-4.1.9.0/gatk HaplotypeCaller \
-R Cvirginica_v300.fa \
-I $sample.dedup-split-RG.bam \
-O $sample.variants.g.vcf \
--native-pair-hmm-threads 12 \
-ERC GVCF
done

is running faster than

for file in ../031722-haplo/*dedup-split-RG.bam
do
sample="$(basename -a $file | cut -d "." -f 1)"

/gscratch/srlab/programs/gatk-4.1.9.0/gatk HaplotypeCaller \
-R ../031722-haplo/Cvirginica_v300.fa \
-L interval.bed \
-I ../031722-haplo/$sample.dedup-split-RG.bam \
-O $sample.variants.g.vcf \
--native-pair-hmm-threads 28 \
-ERC GVCF
done

though as you can see I did not keep threads consistent..