nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

[ancestral] VCF REF != reference base resulting in missing variation #1368

Open jameshadfield opened 8 months ago

jameshadfield commented 8 months ago

Current Behavior

This is a pretty fun bug but a bit hard to explain. Thankfully very easy to fix.

When using VCF inputs to augur ancestral we need a VCF file and a reference FASTA file. The expectation when we read this VCF file is that the REF allele matches the corresponding base(s) in the FASTA, but this is not asserted.

The most (I think) commonly used software which produces VCF files from large alignments is not aware of what the reference is, and thus given a position with segregating alleles {X,Y} it (presumably) chooses the most common as the REF allele. We can then end up in a situation for a given position where:

Possible solution

The simple solution, which I'll implement and tag this issue in, is for TreeTime's read_vcf to verify that the REF in the VCF is the same as the corresponding nucleotides in the FASTA.

However we also need a way to go from a large alignment to a VCF with reference bases matching a given FASTA. It shouldn't be hard to write a script to do this, but a bit of a pain.

Real world example

In the Cholera alignment I'm using to debug VCF-related things this occurs for 131 positions in the genome

jameshadfield commented 7 months ago

This was somehow missed during #1355. This TreeTime branch includes a test highlighting this.