Closed LaraFuhrmann closed 1 year ago
Doesn't the program complain about overlapping variants? The input VCF must be logically consistent, and if ATG is replaced with AG by 18223:ATG>AG
, then the T cannot be subsequently modified to A by 18224:T>A
. And similarly with 18225:G>A
. Although the latter is possible in principle, the program assumes that whatever was asserted in 18223:ATG>AG
is correct and all subsequent overlapping records are ignored.
You are correct, bcftools consensus
indeed complains with
The site NC_045512.2:18224 overlaps with another variant, skipping...
The site NC_045512.2:18225 overlaps with another variant, skipping...
Does this mean that bcftools call
created a logically inconsistent VCF file?
In any case, if we only consider the deletion
NC_045512.2 18223 . ATG AG,A 226.362 . INDEL;IDV=7736;IMF=0.988374;DP=7827;AD=97,7725,5;VDB=0;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2,
0;AN=2;DP4=74,23,0,7730;MQ=60 GT:PL 1/1:253,255,0,255,255,255
we would still expect to consensus sequence to be TAAAA-GAATTA
instead of TAAAAG-AATTA
(see example 5.1.3 in the VCF spec). Do you agree or would you say that due to "the molecular equivalence explicitly listed above in the per-base alignment is discarded, so the actual placement of equivalent g isn’t retained" this information cannot be obtained from the VCF?
Yes, the caller could have done a better job here and the output is inconsistent.
Regarding the exact placement of the deletion, we cannot distinguish betweenTAAAA-GAATTA
and TAAAAG-AATTA
, therefore they are treated as equivalent.
Thank you very much for your response.
Would you agree that our best course of action would be to provide our end-users with the caveat that
- In the case of multiple deletions with the same start position where the longest deletion does not have the highest coverage the deletion placement is ambiguous.
- When multiple variants overlap the same position the one with the first start position is chosen. If they share the same start position the one with the highest coverage is chosen.
I am not sure if I understand completely. What you describe is NOT what bcftools consensus
does. The program takes the first variant and uses that. One needs to pre-filter the VCF to make consensus
blindly apply whatever is left.
What you describe is NOT what bcftools consensus does.
Are you referring to our first or second point?
The program takes the first variant and uses that.
Just to clarify do you mean that not the variant with the highest coverage is chosen?
To both points. The record that comes first in the VCF is always applied, regardless of coverage.
The record that comes first in the VCF is always applied, regardless of coverage.
By record you are referring to the rows of the vcf-file? If there are two subsitutions in a single row the one with the highest coverage is chosen even if it does not appear first in the ALT-list, right?
Regarding the first point: if there are multiple deletions of different length with the same starting position they are reported in one row in the vcf-file. Is bcftools consensus then calling the first one in the ALT-list?
Regarding the second point: Is there a way to filter for the row with the highest coverage if multiple ones have the same start position?
Yes, by record I mean a row of a VCF file.
The program is primarily intended to work with the sample fields, FORMAT/GT. Together with -H
this allows unambiguous assignment of the allele. If not present, the first allele is applied, again, regardless of depth. Why is that? Consider that the program must be able to work with a bare VCF, without any additional tags present.
Regarding the filtering, we don't have a tool to do this specific task. Closest is bcftools filter --IndelGap
which filters nearby indels and leaves the one with higher quality.
Consider the following BAM-file with reference and generate a consensus sequence using the following commands with bcftools version 1.12:
This generates the following deletion and substitutions:
together with the consensus sequence from position 18218-18230 (we add the consensus we would expect from looking at the bcf-file):
As you can see the deletion is at another position and one of the substitutions was ignored.