mbhall88 / NanoVarBench

Evaluating Nanopore-based bacterial variant calling
https://doi.org/10.1101/2024.03.15.585313
MIT License
15 stars 0 forks source link

Filtering compatible overlapping truth variants #5

Closed mbhall88 closed 8 months ago

mbhall88 commented 9 months ago

Regarding the filtering of the truth VCF, there is a question about whether to filter out compatible "overlapping" variants. This is a question I raised on the bcftools repo (https://github.com/samtools/bcftools/issues/2082).

Essentially, I noticed in the truth VCF we still have variants like this after running bcftools +remove-overlaps.

chromosome      28728   62e866df        T       A       .       PASS    .      GT      1/1
chromosome      28728   e4388ae3        T       TA      .       PASS    .    GT      1/1

and

chromosome      30586   60f6a01c        T       TAACAACACTTGG   .       PASS    .      GT      1/1
chromosome      30588   a1f621d4        G       A       .       PASS    .    GT      1/1

As Petr pointed out in the above linked issue, there aren't, in fact, overlapping.

As a way of illustrating how this is possible, let's reframe the variant for clarity and provide an example fasta sequence

>chromosome
GTC
chromosome      2   62e866df        T       A       .       PASS    .      GT      1/1
chromosome      2   e4388ae3        T       TA      .       PASS    .    GT      1/1

bcftools consensus would turn this into

>chromosome
GAAC

And for the second example

>chromosome
CGTAGT
chromosome      3   60f6a01c        T       TAACAACACTTGG   .       PASS    .      GT      1/1
chromosome      5   a1f621d4        G       A       .       PASS    .    GT      1/1

bcftools consensus would turn this into

>chromosome
CGTAACAACACTTGGAAT

As mentioned in this comment, the ordering of the variants here is crucial. If you swap the order of the variants in the first example, they then become overlapping variants (see comment). But as we run bcftools +remove-overlap before running bcftools consensus, we don't need to worry about the ordering.

In our script for generating the truth set and mutated reference I have added a flag that allows us to filter these sorts of positions out if we want. But I am inclined to leave them in as another layer of complexity to explore how the variant callers handle.

Does anyone disagree?

rrwick commented 9 months ago

I'm happy to leave them in, but I think the tricky issue might be how to assess these variants.

Using your second example, if the reference sequence is CGTAGT but the correct variant-applied sequence is CGTAACAACACTTGGAAT, a variant caller could represent this in different ways.

It could do it in two variants, like you showed above:

chromosome   3   60f6a01c   T     TAACAACACTTGG
chromosome   5   a1f621d4   G     A

Or it could do it in one variant:

chromosome   3   ab49ec54   TAG   TAACAACACTTGGAA

Both are correct, so if you're assessing a variant caller, you'd need to accept both answers.

This problem applies not just to 'overlapping' variants but any close variants. And it could get even more complex with groups of close variants. E.g. if you had 3 variants near each other, you could group them in four ways: a,b,c or ab,c or a,bc or abc. And the number of possible groupings will grow enormously with more near variants.

So while I don't think you need to filter out variants because they are 'overlapping', you might want to filter out too-close variants, if that makes assessment easier.

mbhall88 commented 9 months ago

I think the complication you've mentioned here is actually more in the realm of variant assessment. vcfdist (described in this paper) handles this by standardising the variant representation between the truth and query VCFs (See Figure 1b and 3d and Suppl. Fig 2 for good illustrations).

So while I don't think you need to filter out variants because they are 'overlapping', you might want to filter out too-close variants, if that makes assessment easier.

I'd like to keep close variants in there as this will likely be a good separating factor between the variant callers. And we don't want to make it too easy for them 😁

However, I am beginning to think that excluding these valid overlapping variants might be a good idea. I've just found some more complicated examples where bcftools +remove-overlaps doesn't think they're overlaps, but bcftools consensus does.

chromosome_2    1561582 0cb9eac6        A       ATTTCTTTTGATAAGAAAGTATTAAGTG    .       PASS    .       GT      1/1
chromosome_2    1561582 4bc9851a        A       AT      .       PASS    .       GT      1/1
mbhall88 commented 9 months ago

So essentially, these can be removed with bcftools norm -d indels, indicating they're kind of the same variant.

I think this changes my mind and I now think we should remove all of these types of overlaps.

lachlancoin commented 9 months ago

My preference would be to remove them.