Closed sr320 closed 2 years ago
The MarineOmics website has a good option that I haven't yet tested but have also seen mentioned on various forums - here's the site https://marineomics.github.io/fastq2vcf_pipeline.html and here's the pertinent snippet:
Two of the major variant callers you can use (at the time of writing) are Genome Analysis Toolkit (GATK) and freebayes. You can call variants with either of these software in our pipeline and the downstream steps are the same, so here we present parameters for variant calling with both GATK and freebayes. Regardless of the software used, variant calling is computationally intensive but can be parallelized for more efficient resource use by splitting the reference genome into intervals. The interval creation is automated in our pipeline but parameter guidelines will differ depending on your reference genome.
To call germline single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) via local re-assembly of haplotypes using GATK:
gatk HaplotypeCaller -R {reference.fa} -I {input.bam} -O {output.gvcf} -L {interval.list} --emit-ref-confidence GVCF --min-pruning 2 --min-dangling-branch-length 4
Where -R is the reference genome -I is the input BAM -O is the output gVCF -L is the interval file --emit-ref-confidence yields the reference confidence scores as gVCF --min-pruning sets the minimum support to not prune paths in the graph --min-dangling-branch-length sets the minimum lenght of a dangling branch to attempt recovery
Agree about splitting into intervals. I haven't needed to do it myself but it seems to be commonly used. You can also request more threads with this (here asking for 5):
--native-pair-hmm-threads 5
Which is better, intervals or increasing threads (or both). And for intervals would I just do something like create a bed file of 10kbp windows of entire genome? Most examples I have seen seem to focus on selecting specific regions.
For the interval option, it looks like that script I sent you already preps and uses an interval list for the gatk GenomicsDBImport
step (next one after HaplotypeCaller
).
# create interval list (just a list of all contigs in genome)
echo "Creating intervals list"
awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' ${REF}/Paralithodes_platypus_genome.fasta.fai > intervals.bed
I haven't gotten to that step, yet, in my current job so can't confirm that it's foolproof - I'd double check that it's doing what you want before using that code.
FTR increasing threads really does not speed anything up. I have yet to try intervals. I will try just intervalling into chromosomes (~11) and see if that makes a difference.
note that using intervals in this case is slower... eg
source /gscratch/srlab/programs/scripts/paths.sh
for file in *dedup-split-RG.bam
do
sample="$(basename -a $file | cut -d "." -f 1)"
/gscratch/srlab/programs/gatk-4.1.9.0/gatk HaplotypeCaller \
-R Cvirginica_v300.fa \
-I $sample.dedup-split-RG.bam \
-O $sample.variants.g.vcf \
--native-pair-hmm-threads 12 \
-ERC GVCF
done
is running faster than
for file in ../031722-haplo/*dedup-split-RG.bam
do
sample="$(basename -a $file | cut -d "." -f 1)"
/gscratch/srlab/programs/gatk-4.1.9.0/gatk HaplotypeCaller \
-R ../031722-haplo/Cvirginica_v300.fa \
-L interval.bed \
-I ../031722-haplo/$sample.dedup-split-RG.bam \
-O $sample.variants.g.vcf \
--native-pair-hmm-threads 28 \
-ERC GVCF
done
though as you can see I did not keep threads consistent..
As @laurahspencer warned, the following takes a long time (another month)
Is there a way I can increase # threads / memory? I found a few options online but just curious if @ksil91 or others have recommendations before I go down the path of installing new stuff.