samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
666 stars 240 forks source link

bcftools consensus: Incorrect deletion positioning and nearby substitutions placement #1464

Closed LaraFuhrmann closed 1 year ago

LaraFuhrmann commented 3 years ago

Consider the following BAM-file with reference and generate a consensus sequence using the following commands with bcftools version 1.12:

bcftools mpileup \
    --max-depth 1000000 \
    --max-idepth 100000 \
    --annotate INFO/AD \
    --no-BAQ \
    --ignore-overlaps\
    -Ou \
    -f NC_045512.2.fasta \
    deletion_issue.bam \
| bcftools call \
    -mv \
    --keep-alts \
    -Ou \
| bcftools norm \
    -f NC_045512.2.fasta \
    -Oz \
    --output temp.bcf

bcftools index -f temp.bcf

# majority bases
bcftools consensus \
        --output bcftools_majority_consensus.fasta \
        -i "INFO/AD[0]<INFO/AD[*]" \
        --fasta-ref NC_045512.2.fasta \
        --mark-del - \
        temp.bcf

This generates the following deletion and substitutions:

NC_045512.2     18223   .       ATG     AG,A    226.362 .       INDEL;IDV=7736;IMF=0.988374;DP=7827;AD=97,7725,5;VDB=0;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2,
0;AN=2;DP4=74,23,0,7730;MQ=60   GT:PL   1/1:253,255,0,255,255,255

NC_045512.2     18224   .       T       A       189.747 .       DP=43;AD=14,29;VDB=3.90442e-09;SGB=-0.693079;RPB=6.99003e-06;MQB=1;MQSB=1;BQB=0.963647;M
Q0F=0;AC=1;AN=2;DP4=0,14,27,2;MQ=60     GT:PL   0/1:223,0,134
NC_045512.2     18225   .       G       A       228.42  .       DP=7733;AD=13,7715;VDB=0;SGB=-0.693147;RPB=0.531131;MQB=1;BQB=0.999558;MQ0F=0;AC=2;AN=2;
DP4=0,13,0,7715;MQ=60   GT:PL   1/1:255,255,0

together with the consensus sequence from position 18218-18230 (we add the consensus we would expect from looking at the bcf-file):

TAAAAG-AATTA # generated consensus 
TAAAA-AAATTA # expected consensus 

As you can see the deletion is at another position and one of the substitutions was ignored.

pd3 commented 3 years ago

Doesn't the program complain about overlapping variants? The input VCF must be logically consistent, and if ATG is replaced with AG by 18223:ATG>AG, then the T cannot be subsequently modified to A by 18224:T>A. And similarly with 18225:G>A. Although the latter is possible in principle, the program assumes that whatever was asserted in 18223:ATG>AG is correct and all subsequent overlapping records are ignored.

kpj commented 3 years ago

You are correct, bcftools consensus indeed complains with

The site NC_045512.2:18224 overlaps with another variant, skipping...
The site NC_045512.2:18225 overlaps with another variant, skipping...

Does this mean that bcftools call created a logically inconsistent VCF file?

In any case, if we only consider the deletion

NC_045512.2     18223   .       ATG     AG,A    226.362 .       INDEL;IDV=7736;IMF=0.988374;DP=7827;AD=97,7725,5;VDB=0;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2,
0;AN=2;DP4=74,23,0,7730;MQ=60   GT:PL   1/1:253,255,0,255,255,255

we would still expect to consensus sequence to be TAAAA-GAATTA instead of TAAAAG-AATTA (see example 5.1.3 in the VCF spec). Do you agree or would you say that due to "the molecular equivalence explicitly listed above in the per-base alignment is discarded, so the actual placement of equivalent g isn’t retained" this information cannot be obtained from the VCF?

pd3 commented 3 years ago

Yes, the caller could have done a better job here and the output is inconsistent.

Regarding the exact placement of the deletion, we cannot distinguish betweenTAAAA-GAATTA and TAAAAG-AATTA, therefore they are treated as equivalent.

LaraFuhrmann commented 3 years ago

Thank you very much for your response.

Would you agree that our best course of action would be to provide our end-users with the caveat that

- In the case of multiple deletions with the same start position where the longest deletion does not have the highest coverage the deletion placement is ambiguous.
- When multiple variants overlap the same position the one with the first start position is chosen. If they share the same start position the one with the highest coverage is chosen. 
pd3 commented 3 years ago

I am not sure if I understand completely. What you describe is NOT what bcftools consensus does. The program takes the first variant and uses that. One needs to pre-filter the VCF to make consensus blindly apply whatever is left.

LaraFuhrmann commented 3 years ago

What you describe is NOT what bcftools consensus does.

Are you referring to our first or second point?

The program takes the first variant and uses that.

Just to clarify do you mean that not the variant with the highest coverage is chosen?

pd3 commented 3 years ago

To both points. The record that comes first in the VCF is always applied, regardless of coverage.

LaraFuhrmann commented 3 years ago

The record that comes first in the VCF is always applied, regardless of coverage.

By record you are referring to the rows of the vcf-file? If there are two subsitutions in a single row the one with the highest coverage is chosen even if it does not appear first in the ALT-list, right?

Regarding the first point: if there are multiple deletions of different length with the same starting position they are reported in one row in the vcf-file. Is bcftools consensus then calling the first one in the ALT-list?

Regarding the second point: Is there a way to filter for the row with the highest coverage if multiple ones have the same start position?

pd3 commented 3 years ago

Yes, by record I mean a row of a VCF file.

The program is primarily intended to work with the sample fields, FORMAT/GT. Together with -H this allows unambiguous assignment of the allele. If not present, the first allele is applied, again, regardless of depth. Why is that? Consider that the program must be able to work with a bare VCF, without any additional tags present.

Regarding the filtering, we don't have a tool to do this specific task. Closest is bcftools filter --IndelGap which filters nearby indels and leaves the one with higher quality.