samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
665 stars 240 forks source link

consensus should only use the most frequent alternate allele #1488

Open tolot27 opened 3 years ago

tolot27 commented 3 years ago

I've a vcf entry like:

chr 8083    .   GG  GAA,GA  40008.9 .   AB=0.137227,0.819141;ABP=1627.35,1260.12;AC=1,1;AF=0.5,0.5;AN=2;AO=195,1164;CIGAR=1M1I1X,1M1X;DP=1421;DPB=1521.5;DPRA=0,0;EPP=4.89224,29.8739;EPPR=5.49198;GTI=0;LEN=3,1;MEANALT=4,4;MQM=60,60;MQMR=60;NS=1;NUMALT=2;ODDS=188.157;PAIRED=1,0.993127;PAIREDR=1;PAO=0,0;PQA=0,0;PQR=0;PRO=0;QA=7396,44298;QR=537;RO=14;RPL=113,704;RPP=13.7118,114.076;RPPR=3.63072;RPR=82,460;RUN=1,1;SAF=94,596;SAP=3.55595,4.47287;SAR=101,568;SRF=9;SRP=5.49198;SRR=5;TYPE=complex,snp

Unfortunately, bcftools consensus chooses the first of allele, hence the less frequent one, in this case. But it should use the most frequent one. The parameter -H, --haplotype allows some selection but does not allow to select the most frequent allele. It can only select first/last/longer/shorter/etc. allele but sometimes the most frequent one is the first and sometimes the last. Filtering out low frequent alleles drops the complete line instead of removing one of the alternate alleles.

pd3 commented 3 years ago

There are two problems. First, your VCF uses custom annotations and bcftools does not know which of the alleles is most frequent in the pileup. Second, the program was designed to apply a preprocessed VCF as is, not to make filtering decisions; this would be best done upstream of consensus.

tolot27 commented 3 years ago

Which annotations would bcftools use to calculate the frequency? I've used freebayes to call the variants.

bcftools filter and bcftools view omit complete lines with more than one alternative allele even if just one allele should be filtered out. Hence, my current workflow is: vcfbreakmulti <in.vcf> | bcftools view -i '<filter>' /dev/stdin -Oz -o <out.vcf.gz>; tabix <out.vcf.gz> to preprocess the data. Afterwards, I'm using bcftools consensus which can't read uncompressed vcf files. Hence, i cannot pipe it directly to bcftools.

I'd love to see if bcftools could handle multi-allelic variations correctly.

pd3 commented 3 years ago

The term frequency is used in many meanings and none is considered by bcftools consensus. The tool was written for diploid genomes and in this context it does behave correctly. Anything else it can do is a bonus. To push back a little, why don't you write a simple script to remove the less frequent variant as part of a preprocessing step?

tolot27 commented 3 years ago

To push back a little, why don't you write a simple script to remove the less frequent variant as part of a preprocessing step?

That's what I do with vcfbreakmulti as described above. Unfortunately, it creates one more intermediate file because it cannot be piped to bcftools consensus. Would a PR with an additional choice to --haplotype be welcome?

pd3 commented 3 years ago

I could imagine that we extend the plugin tag2tag to support a new option --RO-AO-to-AD, similarly to the existing option --QR-QA-to-QS. This would add the standard FORMAT/AD annotation reserved by the VCF specification.

Then the bcftools consensus could be extended to support a new switch which would choose the maximum value of a Number=R tag requested by the user. For example, one possible way to run it would be

bcftools consensus --haplotype 'max(INFO/AD)'
bcftools consensus --haplotype 'max(FORMAT/AD)'

or something like that.