nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2
https://nextstrain.org/ncov
MIT License
1.35k stars 403 forks source link

representation of a nuc conversion do not point a conversion on the genome #1107

Open huzuner opened 4 months ago

huzuner commented 4 months ago

Hello,

For our research, I have been using resources from Nextstrain and I encountered an ambiguity that I wanted to ask about.

I am currently working on the tabular file for nuc conversions belonging to Nextstrain clades (https://github.com/nextstrain/ncov/blob/master/defaults/clades.tsv). I have been assuming that the clades.tsv represents nucleotide conversions, so they should inform about a nucleotide change compared to the reference genome. However, I noticed that one of the conversions listed there, 22F nuc 15461 A, already points to A on the reference genome (NC_045512v2). Does this have an intended purpose or do I misunderstand the concept?

Thank you in advance for your response.

jameshadfield commented 3 months ago

Yes it looks like there are no (or very few) mutations observed at this position. It may have been a typo. Using all 4 defining positions we can see how they identify 22F.

Note that these definitions are the ones we use for augur clades to define clades on a tree. They're not intended to be used for individual sequence classification. We recommend using nextclade for classification of sequences.

huzuner commented 2 months ago

Thanks for your answer!

What is the difference between augur clades and nextstrain clades? Are they not same or similar?

In my case, I need either full fasta sequences or nucleotide changes of Nextstrain/Nextclade clades so that I can create fasta sequences myself. I did not find any of this in 'Nextclade datasets'. The only resource that is relevant is 'tree.json' but we identified some inconsistencies for some clades when compared with covariants.org. That's how I ended up with this 'clades.tsv', assuming that it provides the actual nucleotide differences belonging to clades.

jameshadfield commented 2 months ago

cc @corneliusroemer - I think you have a fasta of representative sequences for each lineage?

huzuner commented 2 months ago

cc @corneliusroemer - I think you have a fasta of representative sequences for each lineage?

Would be great to hear if this exists.