hodcroftlab / covariants

Real-time updates and information about key SARS-CoV-2 variants, plus the scripts that generate this information.
https://covariants.org/
GNU Affero General Public License v3.0
317 stars 111 forks source link

Missing clades in defining mutations #393

Open huzuner opened 3 months ago

huzuner commented 3 months ago

Hi,

Is there a reason why some missing clades are not listed here: https://github.com/hodcroftlab/covariants/tree/master/defining_mutations ? Like 20A, 21I and 21J? Though I can see that they're listed in the website, they're just not there in the directory in the repo. Or do you have another resource that lists all defining mutations of all Nextstrain clades?

I would be more than glad for an answer, thank you!

emmahodcroft commented 4 days ago

Hi @huzuner - apologies for such a delayed answer! For the defining mutations, these are something I do by hand to make the information more accessible, but I only started this part-way through the pandemic, and haven't had time to go back and generate this for older variants, I'm afraid!

Though the information isn't in the same format (in particular, the nucs and AAs associated and also information on whether its from the parent or a reversion) - as you noticed, it is on the website and thus is available, just not as 'prettily' :)

You can find the list of defining mutations (nonsynonymous as AA and synonymous as nuc - what shows up on the right-hand-side of the variant page on the website) for any variant on the website here: https://github.com/hodcroftlab/covariants/blob/master/scripts/clusters.py

As I said, it's not as complete but it may still be helpful. If for any reason you end up assembling similar files for the 'missing variants' I would really appreciate if you would submit them to the github and I could include them for others!