jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Nextstrain tree only contains intergenic mutations #70

Open jeromekelleher opened 1 year ago

jeromekelleher commented 1 year ago

The current nextstrain conversion script converts the "nuc" mutations, and ignores the rest. I thought that this was helpfully giving the mutations both in gene and nucleotide format - but I think now that it's just for the intergenic mutations.

@hyanwong - this would make the comparison in terms of mutations pretty meaningless, unless we also do the same thing (i.e., get the gene mutations as well - I'm working on this)

hyanwong commented 1 year ago

Ah! I haven't got around to looking at the mutations yet: currently focussing on the trees, but yes, a good thing to know about and fix.

hyanwong commented 1 year ago

Note that this is actually getting the NextClade tree which is in JSON format (not the NextStrain tree which is in Nexus format, and downloadable from the link at the bottom of the nextstrain pages, rather than via a URL). The NextClade tree doesn't have branch lengths, so we decided not to use it for the time being. But the NextStrain tree doesn't have mutations of any sort on it. This is probably disallowed by GISAID anyway. Here's some details for how to actually get that data, if we want it:

https://discussion.nextstrain.org/t/sars-cov-2-mutation-data/78/3

I think we don't need it for the preprint, though.