samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
640 stars 241 forks source link

Remove overlaps not working as expected #2082

Open mbhall88 opened 6 months ago

mbhall88 commented 6 months ago

I'm not sure if I am misunderstanding how bcftools +remove-overlaps works, but I still have some overlapping variants in the output afterwards

e.g.,

chromosome      28728   62e866df        T       A       .       PASS    .      GT      1/1
chromosome      28728   e4388ae3        T       TA      .       PASS    .    GT      1/1

and

chromosome      30586   60f6a01c        T       TAACAACACTTGG   .       PASS    .      GT      1/1
chromosome      30588   a1f621d4        G       A       .       PASS    .    GT      1/1
pd3 commented 6 months ago

These variants are not overlapping. For example, the insertion in the first example inserts A between the positions 28728 and 28729.

mbhall88 commented 6 months ago

Ahhh okay I see what you mean with the first example. But am struggling to understand how the second example isn't overlapping?

pd3 commented 6 months ago

It's exactly the same case. The first happens between 30586 and 30587, 30588 is not affected by the insertion.

mbhall88 commented 6 months ago

Alright, so I did some playing and testing and it seems the order of the variants impacts what bcftools consensus does and how bcftools views them in terms of overlapping variants.

As a way of illustrating, let's reframe the variants for clarity and provide an example fasta sequence

First example:

>chromosome
GTC
chromosome      2   62e866df        T       A       .       PASS    .      GT      1/1
chromosome      2   e4388ae3        T       TA      .       PASS    .    GT      1/1

bcftools consensus would turn this into

>chromosome
GAAC

BUT, if I switch the order of the variants I get

>chromosome
The site chromosome:2 overlaps with another variant, skipping...
GTAC

And indeed, if I run bcftools +remove-overlaps on the first example, with the order swapped, both variants are removed from the VCF.

Second example:

>chromosome
CGTAGT
chromosome      3   60f6a01c        T       TAACAACACTTGG   .       PASS    .      GT      1/1
chromosome      5   a1f621d4        G       A       .       PASS    .    GT      1/1

bcftools consensus would turn this into

>chromosome
CGTAACAACACTTGGAAT

The ordering situation from the first example obviously doesn't apply here given the positions are different.

tl;dr make sure you run bcftools +remove-overlaps before running bcftools consensus

mbhall88 commented 6 months ago

Me again.

I've just run into an example where bcftools +remove-overlaps doesn't deem the variants overlapping, but bcftools consensus does

chromosome_2    1561582 0cb9eac6        A       ATTTCTTTTGATAAGAAAGTATTAAGTG    .       PASS    .       GT      1/1
chromosome_2    1561582 4bc9851a        A       AT      .       PASS    .       GT      1/1
mbhall88 commented 5 months ago

Sorry to be a pest @pd3, not sure if this dropped off your radar (I imagine you get a lot of notifications)?

Is my last example a bug?