samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
640 stars 241 forks source link

bcftools call changing ploidy option gives very different results #2030

Closed rmurray2 closed 8 months ago

rmurray2 commented 8 months ago

I've got some oxford nanopore (ONT) raw reads of a mixed population of about 50 gene variants of a single gene. The exact pipeline was: Mutagenic PCR -> cloning -> colony picking -> colony PCR -> pooling of all PCR products followed by ONT sequencing. The goal is to capture/clone gene variants, and the material is pooled PCR product so prima facie it seems that 1 would be the most appropriate value for the ploidy parameter (Treat all samples as haploid)

I've processed the raw data like so:

minimap2 -ax map-ont ref.fasta out.fasta > alignment.sam samtools sort alignment.sam -o alignment_sorted.sam bcftools mpileup -O b -o res.bcf -f ref.fasta alignment_sorted.sam

First I try running call without specifying ploidy, so the assumption is that all samples are diploid:

bcftools call -m -v -o variants.vcf res.bcf gives what looks like a very high quality variant that I targeted in PCR:

POS: 431
REF: A
ALT: G
QUAL: 221 
INFO: 
DP=39;VDB=0.0741752;SGB=-0.670168;RPB=0.733099;MQB=1;MQSB=1;BQB=1;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=3,26,2,8;MQ=60

If I do the following, however, I get no called variants:

bcftools call -m --ploidy 1 -v -o variants.vcf res.bcf

So in the second case, by treating my samples as haploid I have no variants, whereas if I treat my samples as diploid, the variant appears. To make matters more complicated, if I do the same thing for other genes, sometimes treating samples as haploid gives high quality variants that don't appear when treating samples as diploid. I've been thinking of just taking the union of variants that both haloid and diploid give.

So my question would be: Given my experimental setup, what would the best way to run bcftools call ?

And more generally, how is it possible that treating as haploid/diploid can make the difference between high quality variants appearing or not? Is it because one raw reads file has possibly multiple allelic variants of the same gene, and the program is interpreting those as polyploidy?

pd3 commented 8 months ago

Short answer: I wouldn't use bcftools call --ploidy 1 on such data.

In the example there are 29 reads in support of the reference allele G and 10 in support of the alternate allele T (DP4=3,26,2,8). Forcing a haploid call forces the program to choose which is more likely, A or G? I am not surprised the reference allele was chosen.

In the default calling the program still makes some strong assumptions and asks, what are the odds of sampling 29:10 reads from a diploid genome. For this specific case the odds are still high, however cases like 30:1 would be evaluated as sequencing errors more likely than A/G genotype.

The program was designed for germline variant calling. You can still use the output of mpileup and design your own model to determine whether the alternate bases should be evaluated as errors or as the real thing.