hodcroftlab / covariants

Real-time updates and information about key SARS-CoV-2 variants, plus the scripts that generate this information.
https://covariants.org/
GNU Affero General Public License v3.0
316 stars 112 forks source link

Feature Request: Full AA sequence of variant #327

Closed utimcraig closed 2 years ago

utimcraig commented 2 years ago

Hello, Right now there is a listing of mutations for each variant, but it would be really convenient if instead of just showing the mutations, there is actually an amino acid sequence of the entire spike, or other protein where there are mutations. With all of the deletion and insertion mutants it is very easy to have the numbering messed up if you are trying to make, for example, a sequence alignment of the spike proteins for BA.1, BA.2, and BA.5. If that exists somewhere else I would be grateful if you could point me to it.

Is: image

Could: Show full AA sequence of S, ORF1a, N, ORF1b, ORF8 perhaps with a collapsing switch to make it look nice

emmahodcroft commented 2 years ago

Hi @utimcraig! I'm afraid I don't always have the ability to share an amino-acid sequence for the root of the tree. Sometimes, the only sequences available at such a root may be from GISAID, which cannot be openly shared. Other times, there is no real sequence which corresponds to the root (I determine the mutations from another sequence, and then remove any that are only specific to that sequence).

However, one can sometimes use the focal builds to identify sequences close to the root or at the root of a cluster, which one could then get from Genbank or GISAID, and use to get such alignments. Alternatively, the Nextclade reference tree for SARS-CoV-2 does have synthetic sequences, some of which may be at the bases of clusters. The sequences themselves are not available, but one could reconstruct the mutations along the tree if one wanted: https://nextstrain.org/nextclade/sars-cov-2/2022-04-25

I am sorry I can't provide such alignments myself, but hope some of this is helpful!