samtools / htslib

C library for high-throughput sequencing data formats
Other
812 stars 446 forks source link

faster variant calling by threading BAQ calculation? #749

Open dkj opened 6 years ago

dkj commented 6 years ago

If the BAQ calculation is a significant compute fraction of a variant calling pipeline can we get a faster turnaround by threading that BAQ calculation? (be it in samtools calmd, samtools mpileup, or bcftools mpileup

jkbonfield commented 6 years ago

It dawned on me we can do a poor mans threading here using calmd | mpileup. Eg:

Orig

$ time bcftools mpileup --threads 2 --fasta-ref $HREF -Ou ~/scratch/data/9827_2#49.1m.bam > /dev/null
[mpileup] 1 samples in 1 input files

real    0m18.121s
user    0m17.889s
sys 0m0.180s

$ time bcftools mpileup --threads 2 --fasta-ref $HREF -Ou -B ~/scratch/data/9827_2#49.1m.bam > /dev/null
[mpileup] 1 samples in 1 input files

real    0m13.964s
user    0m13.717s
sys 0m0.204s

Two threads here is simply to ensure asynchronous I/O and decoding of BAM, but I doubt thats's the bottleneck in bcftools anyway. This also shows (albeit on a tiny sample) the BAQ is about 1/3rd the cpu cost so that's how much we'd gain at most with threading this part.

Explicit calmd

$ time samtools calmd -@2 -u -E -r ~/scratch/data/9827_2#49.1m.bam $HREF | bcftools mpileup --threads 2 --fasta-ref $HREF -Ou - > /dev/null
[mpileup] 1 samples in 1 input files

real    0m15.031s
user    0m21.517s
sys 0m0.400s

It's not quite as fast, but calmd was running at about 50% CPU and bcftools at 100% so it's managing to remove most of the BAQ overhead into a second process. I checked the md5sum (when in VCF mode) before and after.

jkbonfield commented 6 years ago

Thanks to an office discussion (with @dkj) this brings ideas for the multi-sample bcftools too.

Compare (fake 10x 1 sample, but go with it...):

time bcftools mpileup --threads 2 --fasta-ref $HREF ~/scratch/data/9827_2#49.1m.bam ~/scratch/data/9827_2#49.1m.bam ~/scratch/data/9827_2#49.1m.bam ~/scratch/data/9827_2#49.1m.bam ~/scratch/data/9827_2#49.1m.bam ~/scratch/data/9827_2#49.1m.bam ~/scratch/data/9827_2#49.1m.bam ~/scratch/data/9827_2#49.1m.bam ~/scratch/data/9827_2#49.1m.bam ~/scratch/data/9827_2#49.1m.bam | bcftools call -vmO v -o _1.vcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 10 input files

real    1m23.977s
user    1m34.834s
sys 0m0.772s

vs

time bcftools mpileup --threads 2 --fasta-ref $HREF -Ou <(samtools calmd -E -@2 -r -A -u ~/scratch/data/9827_2#49.1m.bam $HREF)  <(samtools calmd -E -@2 -r  -u ~/scratch/data/9827_2#49.1m.bam $HREF) <(samtools calmd -E -@2 -r --u ~/scratch/data/9827_2#49.1m.bam $HREF) <(samtools calmd -E -@2 -r --u ~/scratch/data/9827_2#49.1m.bam $HREF) <(samtools calmd -E -@2 -r --u ~/scratch/data/9827_2#49.1m.bam $HREF) <(samtools calmd -E -@2 -r --u ~/scratch/data/9827_2#49.1m.bam $HREF) <(samtools calmd -E -@2 -r --u ~/scratch/data/9827_2#49.1m.bam $HREF) <(samtools calmd -E -@2 -r --u ~/scratch/data/9827_2#49.1m.bam $HREF) <(samtools calmd -E -@2 -r --u ~/scratch/data/9827_2#49.1m.bam $HREF) <(samtools calmd -E -@2 -r --u ~/scratch/data/9827_2#49.1m.bam $HREF)| bcftools call -vmO v -o _3.vcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 10 input files

real    0m28.917s
user    0m28.682s
sys 0m0.472s

The huge difference now makes me realise my previous test was invalid to some extent as it was shallow data. On deep data BAQ becomes dominant as it's per read, but the mpileup speed scales sub-linearly with read depth (it has a linear component plus a fixed portion, which is dominant at low depths).

So there is potential merit here still; even if it's just to avoid horrendous shell hackery!