nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
219 stars 61 forks source link

Lack of consistancy between versions #507

Closed garcia-nacho closed 2 years ago

garcia-nacho commented 3 years ago

Hi,

We have noticed a lack of consistency between Nextclade versions on at least one mutation: The deletion of nucleotides 22029-22034 on previous versions (tested on 0.12.0 and 0.14.4) is translated as: S:E156G, S:F157-, S:R158-

While new versions (tested on 1.2.0) translate it as : S:E156-, S:F157-, S:R158G

This makes the comparison between outputs from different versions quite complicated and might result in wrong conclusions when Nextclade is used for surveillance purposes (e.g. E156G disappearing while R158G appearing). Note that the deletion of 22029-22034 is quite prevalent in the delta lineage.

Moreover, we have found that in presence of dropouts (something quite common in amplicon-based sequencing) overlapping with the beginning of the gene, the new version ignores the mutations on that particular gene while older versions reported all the mutations found. We think that the old behavior is more correct because the dropouts on a particular region of a gene do not invalidate the mutations present in other regions of that gene (i.e. The dropout might be caused by a new mutation overlapping with a primer, not because of low depth)

Best

ivan-aksamentov commented 3 years ago

Hi @garcia-nacho,

Nextclade version 1.x is using different alignment algorithm (we call it Nextalign and there is a dedicated CLI tool with the same name). Nextclade is a very young project and we keep developing it. The virus is evolving and keeps finding new tricks. It's a moving target and we have to adapt software constantly to keep the analysis results practically useful on real-world data, which is rarely perfect.

We use semantic versioning and the major version increment from 0.14.4 to 1.0.0 means there are breaking changes, as described in the changelog. In particular, the new alignment algorithm is informed by codon boundaries, and in presence of ambiguous alignment options, chooses the one that we think makes more sense in practice. Hence the change in results (the gap position changed) you observed.

We are thinking how we can help people to make their experiments more reproducible, but considering the cat & mouse game virus is imposing on us, we don't have any good solutions. But here is a couple of thoughts:

This problem is not unique to Nextclade, as other tools struggling with SARS-CoV-2 pecularities as well. For example, the current problem everybody is trying to tackle is the frame shift and premature stop codon in some of the recent strains (one, two). So stay alert.

We are happy to make Nextclade more useful for surveillance folks. Let us know if you have ideas. Let's discuss here, on Nextstrain discussion forum (https://discussion.nextstrain.org/), or on ubioinfo Slack (microbial-bioinfo.slack.com).


You can find the changelog for every version on Releases page: https://github.com/nextstrain/nextclade/releases (scroll down for version 1.0.0, which introduced the change in question)

or on the documentation website in "Recent changes" section: https://docs.nextstrain.org/projects/nextclade/en/latest/changes/index.html

ivan-aksamentov commented 3 years ago

the new version ignores the mutations on that particular gene while older versions reported all the mutations found. We think that the old behavior is more correct because the dropouts on a particular region of a gene do not invalidate the mutations present in other regions of that gene (i.e. The dropout might be caused by a new mutation overlapping with a primer, not because of low depth)

This is likely due to deletions and insertions that cause a frame shift. Nextalign/Nextclade 1.x currently refuses to translate genes with length that is not multiple of 3. And no amino acids means no mutations. Pretty much the same problem as with clade 21H and lineages AY, I linked above. Watch out for warnings Nextclade yields (into terminal in case of CLI version or in tooltips in Nextclade Web).

The reason version 0.x reported mutations in these areas is that it translated the whole genome triplet by triplet without caring about genes or what the result is. This is very naive, and as you can imagine, the mutations after a frame shift detected in this way are pretty much random.

We are dealing with this problem currently. Stay tuned.

If you think it's not the problem I described, please give some more info and provide your sequences if possible, to reproduce and debug. You can find my email here.

Again, happy to discuss improvement options. Also happy to accept code contributions, if you have bandwidth.

corneliusroemer commented 3 years ago

Moreover, we have found that in presence of dropouts (something quite common in amplicon-based sequencing) overlapping with the beginning of the gene, the new version ignores the mutations on that particular gene while older versions reported all the mutations found.

@garcia-nacho I'd be happy to look into this to see if it is a bug or expected behaviour, but without sequences it's not possible to find out what's going on.

Could you list GISAID and/or Genbank accession ids of a few representative example sequences that demonstrate problem you describe?

garcia-nacho commented 3 years ago

Thank you very much @ivan-aksamentov and @corneliusroemer for looking into this so quickly.

We will probably return to versions < 1 for the moment. It will be easier for us to re-run Nextclade and update our databases for a few samples than for several thousands of them.

Just a comment on the particular 22029-22034 deletion, both translations (S:E156-, S:F157-, S:R158G and S:E156G, S:F157-, S:R158-) are equally correct since the deletion overlaps the three codons. However, the S:E156G (Nextclade < 1) seem to be slightly more commonly found in the scientific literature and it is also the mutation that CoVsurver assigns to that deletion, so I guess that S:E156G is the de facto standard

@corneliusroemer This sequence causes the described behavior: EPI_ISL_3043634

Looking at GISAID, the sequence has these mutations on the S protein:

Spike D614G, Spike D950N, Spike E156G, Spike F157del, Spike G142D, Spike L452R, Spike P681R, Spike R158del, Spike T19R, Spike T95I, Spike T478K (Note again that GISAID also uses the SE156G name)

None of them are detected by Nextclade 1.2 (but they are detected by version 0.12), instead, a sentence is included on the .errors.csv file:

"When extracting gene ""S"": Genes are expected to have length that is a multiple of 3, but the extracted sequence has length 3814 after gap stripping. Gene coordinates: start: 21562, end: 25384, length: 3822

So it is probably due not to a dropout but to a frameshift. Frameshifts on certain accessory genes (e.g ORF7a) are very real and seem to have biological implications Nemudry et al., (2021).

We will definitely try to follow more closely the development of such a useful tool as Nextclade :)

rneher commented 3 years ago

Hi @garcia-nacho, thanks for raising this.

Regarding the deletion in the 156-158 region of Spike. as you note, the two alignments are equivalent. Which one the aligner picks is a matter of convention.

EFR
--G

or

EFR
G--

since two nucleodites of codon 156 are deleted vs 1 nucleotide of 158, one might argue that our choice is more natural. But it will come down to whether in such an ambiguous situation the aligner picks gap-first or match-first. If we change this now, we might generate changes in other locations. The Nextclade web interface includes mutation pop-overs that show exactly how the nucleotide deletion corresponds to the amino acid changes.

The sequence EPI_ISL_3043634 mention has a 2nt deletion at the beginning of Spike (position 21891-21892). This results in a non-functional Spike protein. We are currently debating how to handle such situations. Currently, we call the nucleotide mutations, but do not call the amino acid mutation. One could of course "correct" the frameshift and translate the rest of the gene as is, but this would be wrong if that frame shift was real (like in the case of the frame shift in Orf3a in clade 21H).

ivan-aksamentov commented 3 years ago

It will be easier for us to re-run Nextclade and update our databases for a few samples than for several thousands of them.

Analyzing 1000 sequences may take about 3 minutes on an average laptop, even in Nextclade Web. And Nextclade CLI is even faster. It may take seconds on a high performance server/workstation. It is definitely worth to reconsider this.

garcia-nacho commented 3 years ago

Hi @garcia-nacho, thanks for raising this.

Regarding the deletion in the 156-158 region of Spike. as you note, the two alignments are equivalent. Which one the aligner picks is a matter of convention.

EFR
--G

or

EFR
G--

@rneher I am fully aware that it is a matter of nomenclature, but since GISAID and the older versions of Nextclade use the E156G term, we would prefer to stay like that for consistency in our databases. The sequence EPI_ISL_3043634 is just an example. Sequences with real and functionally relevant frameshifts are excluded from the analysis as well. Unfortunately, we are currently tracking several frameshifts on accessory proteins.

@ivan-aksamentov I know that Nextclade is extremely fast, indeed I have run the entire GISAID more or less overnight. The problem for us is the rest of the elements that we include in our databases. Moreover, the issue with the lack of information when a frameshift arises is very concerning for us. As I mentioned above, we are currently studying several frameshifts.

ivan-aksamentov commented 3 years ago

@garcia-nacho Nextclade CLI 1.4.0 and Nextclade Web 1.7.0 no longer ignore genes with frame shifts. Feel free to give it a try!

See changelog for more context.

corneliusroemer commented 2 years ago

I'm closing this issue as the questions have been thoroughly discussed and there's not something concrete to do at the moment.