Closed mbhall88 closed 8 months ago
I'm happy to leave them in, but I think the tricky issue might be how to assess these variants.
Using your second example, if the reference sequence is CGTAGT
but the correct variant-applied sequence is CGTAACAACACTTGGAAT
, a variant caller could represent this in different ways.
It could do it in two variants, like you showed above:
chromosome 3 60f6a01c T TAACAACACTTGG
chromosome 5 a1f621d4 G A
Or it could do it in one variant:
chromosome 3 ab49ec54 TAG TAACAACACTTGGAA
Both are correct, so if you're assessing a variant caller, you'd need to accept both answers.
This problem applies not just to 'overlapping' variants but any close variants. And it could get even more complex with groups of close variants. E.g. if you had 3 variants near each other, you could group them in four ways: a,b,c or ab,c or a,bc or abc. And the number of possible groupings will grow enormously with more near variants.
So while I don't think you need to filter out variants because they are 'overlapping', you might want to filter out too-close variants, if that makes assessment easier.
I think the complication you've mentioned here is actually more in the realm of variant assessment. vcfdist (described in this paper) handles this by standardising the variant representation between the truth and query VCFs (See Figure 1b and 3d and Suppl. Fig 2 for good illustrations).
So while I don't think you need to filter out variants because they are 'overlapping', you might want to filter out too-close variants, if that makes assessment easier.
I'd like to keep close variants in there as this will likely be a good separating factor between the variant callers. And we don't want to make it too easy for them 😁
However, I am beginning to think that excluding these valid overlapping variants might be a good idea. I've just found some more complicated examples where bcftools +remove-overlaps
doesn't think they're overlaps, but bcftools consensus
does.
chromosome_2 1561582 0cb9eac6 A ATTTCTTTTGATAAGAAAGTATTAAGTG . PASS . GT 1/1
chromosome_2 1561582 4bc9851a A AT . PASS . GT 1/1
So essentially, these can be removed with bcftools norm -d indels
, indicating they're kind of the same variant.
I think this changes my mind and I now think we should remove all of these types of overlaps.
My preference would be to remove them.
Regarding the filtering of the truth VCF, there is a question about whether to filter out compatible "overlapping" variants. This is a question I raised on the bcftools repo (https://github.com/samtools/bcftools/issues/2082).
Essentially, I noticed in the truth VCF we still have variants like this after running
bcftools +remove-overlaps
.and
As Petr pointed out in the above linked issue, there aren't, in fact, overlapping.
As a way of illustrating how this is possible, let's reframe the variant for clarity and provide an example fasta sequence
bcftools consensus
would turn this intoAnd for the second example
bcftools consensus
would turn this intoAs mentioned in this comment, the ordering of the variants here is crucial. If you swap the order of the variants in the first example, they then become overlapping variants (see comment). But as we run
bcftools +remove-overlap
before runningbcftools consensus
, we don't need to worry about the ordering.In our script for generating the truth set and mutated reference I have added a flag that allows us to filter these sorts of positions out if we want. But I am inclined to leave them in as another layer of complexity to explore how the variant callers handle.
Does anyone disagree?