nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
219 stars 59 forks source link

ENH(nextclade cli): nextclade dataset list: indicate whether clades can be assigned #1458

Closed AngieHinrichs closed 5 months ago

AngieHinrichs commented 5 months ago

In the output of nextclade dataset list it would be very helpful to have an indication of whether clades can be assigned using each dataset. For example, dataset nextstrain/flu/h3n2/ha/EPI1857216 can assign clades, but nextstrain/flu/h3n2/pb1 cannot (it has no tree.json). Currently, in order to determine that, I need to download each dataset and look for tree.json.

Does the presence of tree.json in a dataset always mean that clades can be assigned? If so, then hopefully it would be straightforward for nextclade dataset list to report whether pathogen.json includes treeJson.

ivan-aksamentov commented 5 months ago

@AngieHinrichs

Hi Angie,

Does the presence of tree.json in a dataset always mean that clades can be assigned?

We released 3.6.0 just earlier today where clades become optional even if the tree is present. And previously our folks used empty string in place of clade_membership tree field as a workaround if clades are missing from the tree for one reason or the other (most of the times this is due to unclear nomenclature, or lack of time).

Currently I'd say downloading the tree and looking if there's at least one .node_attrs.clade_membership in it is a safe bet.

In the official datasets in the data repo, when rebuilding the dataset index, we could enumerate datasets "capabilities". I have some basics emitted into the index.json of the dataset server, but not clade assignment. Might be a good addition.

Do you have any other such capabilities in mind that we could add? I am having difficulties imagining how that would look from the user perspective, as me myself I don't use Nextclade often :)

Once we have a list of capabilities in the index, the --json flag to the dataset list command should show it like it appears in the index. Then the list can be pretty-printed in CLI and rendered in Web in some way. Any preferences here?

ivan-aksamentov commented 5 months ago

We should also not forget about clade-like attributes which may also be present on the tree in .meta.extensions.nextclade.clade_node_attrs, e.g. lineages in SC2 trees.

ivan-aksamentov commented 5 months ago

The tree-related capabilities could be computed in the rebuild script somewhere around here, I guess https://github.com/nextstrain/nextclade_data/blob/403e2574654daacc40b0face461965da41e953d2/scripts/rebuild#L43-L45

AngieHinrichs commented 5 months ago

The tree-related capabilities could be computed in the rebuild script somewhere around here, I guess https://github.com/nextstrain/nextclade_data/blob/403e2574654daacc40b0face461965da41e953d2/scripts/rebuild#L43-L45

Yes, if you could add "clades" there like you add "customClades", and include the capabilities in the cli list output, that would be great! At the moment, clades are what I'm keen to see, but I would not mind seeing other special capabilities listed.

ivan-aksamentov commented 5 months ago

Implemented in https://github.com/nextstrain/nextclade/pull/1473 and https://github.com/nextstrain/nextclade_data/pull/205

ivan-aksamentov commented 5 months ago

Released in 3.7.0

AngieHinrichs commented 5 months ago

Fantastic, thanks! The types and counts are really helpful!