nextstrain / auspice

Web app for visualizing pathogen evolution
https://docs.nextstrain.org/projects/auspice/
GNU Affero General Public License v3.0
292 stars 162 forks source link

Large Deletions not Detected/Portrayed in Auspice #1500

Open CCranney opened 2 years ago

CCranney commented 2 years ago

First of all, thank you for this and other NextStrain packages! They are making the process of visualizing my project significantly smoother.

Current Behavior

I'm not sure if this is expected behavior or not, so I thought I'd make an issue.

Genomes with large deletions are considered "more closely related" to the ancestral genome than those with single mutations in Auspice view trees. I called the offending genome (genome 2) "THIS_HAS_A_HUGE_CHUNK_MISSING" in the below tree during my initial investigation and kept it that way for visibility. Note that you may not get this exact tree if you run the code, I had removed almost entirely blank genomes to produce this figure, but those are included in the consensus.fasta in the zipped directory below.

image

Deletions are also not portrayed the same way in the Diversity figure between nucleotide and amino acid views. Where in "NT" view it behaves as the tree does - ignoring large deletions entirely - in "AA" mode it clearly portrays deletions all across the protein. Note that below represent parts of viral genomes with an inserted protein of interest, and so sometimes the virus removes said protein from its genome.

NOTE: The "3' UTR" and "5' UTR" labels were manually added by myself, so that is not expected behavior from typical augur export outputs.

image

image

Expected behavior

I would imagine that genomes with substantial deletions would be considered a separate branch of the tree, or that the Diversity figure would be consistent in portraying those deletions between nucleotides and amino acids. Let me know if this is expected behavior on either front.

How to reproduce

Steps to reproduce the current behavior:

  1. Run bash script in the same file as the provided consensus file and GenBank record (all in .zip file). Alternatively, I've also included my edited output from augur export to add the UTR tags. You could just run auspice on that. adam_output.zip

  2. Run auspice view in the given directory and compare the NT and AA views for the Diversity figure.

emmahodcroft commented 2 years ago

Hi @CCranney -- thanks for reporting this and writing it up so clearly!

You're right, there are some inconsistencies in how we handle deletions in Nextstrain - this reflects to some extend the inconsistency with how this is handled in sequences. Unfortunately, it's really not uncommon at all for missing segments of sequencing (due to lack of coverage, rather than a suspected real deletion) to be given as - (deletion). This is something that is found across multiple pathogens, and if it happens in particular sections of the genome, this 'deletion variation' at nucleotide level ends up completely masking 'real' variation. If it happens often, but in different sections of the genome, the diversity panel can become almost unusable. The compromise we reached on this was to not show deletions for nucleotides, but to show them for amino-acids (so that they are visible to some extent), as you noticed.

It would be nice to have an option within Auspice to toggle whether to display deletions or not in the diversity tree, so people could make this judgement based on how correctly they believe their dataset makes use of - vs N.

Regarding the branch lengths/diversity, many treebuilders treat - the same as N or other missing-data characters, and this is also the case for IQTree, RAxML & PhyML (according to the IQTree documentation). I imagine this may also be also due to the fact that sequences have not been very reliable in using - vs N. You could investigate if there are other treebuilding programs that treat these sites differently!