nextstrain / nextclade_data

Datasets for https://github.com/nextstrain/nextclade
https://clades.nextstrain.org
32 stars 27 forks source link

Add measles dataset #202

Closed kimandrews closed 5 months ago

kimandrews commented 5 months ago

Add a measles dataset to Nextclade.

ivan-aksamentov commented 5 months ago

Nice!

https://clades.nextstrain.org/?dataset-server=gh:@add-measles-dataset@&dataset-name=nextstrain/measles

Seems to be working technically, but I haven't checked science of things (I am not a scientist :))

I know nothing about measles, but if makes sense, please consider creating subdirectories, in case there will be more dataset flavors in the future. For example you could distinguish datasets by ref accession: nextstrain/measles/NC-001498-1, nomenclature: nextstrain/measles/who/clade-A/subtype-1, or some prominent feature(s): nextstrain/measles/host-elephants/flavor-vanilla, or something like that. Path choice considerations are described in the docs/ directory here.

By contrast to nextstrain.org urls, Nextclade datasets cannot be nested one inside another (in other words, only leaf directories can contain dataset files), so if we include just nextstrain/measles directory this time, this means nextstrain/measles/2 and nextstrain/measles/another will not be possible, so we'll have to invent something like nextstrain/measles-2, nextstrain/measles-another, which will not be nice to other organisms.

Additionally, it is possible to add shortcuts (aliases) (example), so you can for example alias nextstrain/measles/A/12/delta to nextstrain/measles and nextstrain/measles/A and even measles. Shortcuts must not have conflicts across all of the datasets.

rneher commented 5 months ago

Nice! As Ivan said, the would make sense to generate one additional level to the path (for example the reference, or an indication that this is just N). We might want to add a full genome build at some point.

Consider using --include-root-sequence-inline during the export step. This will allow coloring by sites that aren't variable.

Nextclade also has no use for time information on the trees. It doesn't really hurt, but leads to unexpected behavior if users start toggling between time and divergence. So you might consider removing it for the nextclade build.