Prepare for Changes in Clade Naming in Nextclade/Nextstrain

emmahodcroft commented 1 year ago

Nextclade now breaks down Nextstrain clades into year-letter and WHO, and only gives the "old" 'full' name in a new column, clade_legacy.

Example: Old: clade_nextstrain == 22F (Omicron)

New: clade_nextstrain == 22F clade_who == Omicron clade_legacy == 22F (Omicron)

This doesn't directly impact CoVariants as we don't use the Nextclade file directly, but the metadata.tsv that comes after the ncov-ingest workflow. Currently this hasn't changed, but it may change either by just replacing Nextstrain_clade (which we use) with the shortened name, or by doing this and also adding a "legacy" column.

For clarity, we currently compare values in Nextstrain_clade with display_name from clusters.py (containing things like 22F (Omicron))

If a legacy column is added, switching is as simple as just using this new column, with the rest of the code remaining the same. If there isn't one, or we want to be more future-proof, we should ensure we can just use a different entry in clusters.py which has the year-letter name.

We currently have an entry nextstrain_name, but this has been used inconsistently - sometimes with the 'full' name (21L (Omicron)) and sometimes just year-letter (22A). To help us switch to that option more easily in future, I propose switching now so that all nextstrain_name entries are year-letter.

This should mean that in future, we would need to switch from using display_name in cluster_analysis.py to using nextstrain_name. This shouldn't be too bad but will need checking as it's a little more complex than I thought.

If this is the path we go, here's a small checklist:

[x] Change all nextstrain_name to use year-letter
[ ] Adjust cluster_analysis.py to use nextstrain_name instead of display_name - and check it works.

Clearly, all of the above is only relevant to clades we track that are official Nextstrain clades. For those that aren't official (mostly older ones), we use Pango or SNPs, so this is unchanged.

ivan-aksamentov commented 1 year ago

@emmahodcroft Let's see if this also affects web. Theoretically, web *should* use only build_name in significant places, but there might be some funny effects in case I deviated from that. So please also watch out for strange things in web as you migrate.

I hope you don't need to change build names. If you do, then it will be a journey, because that's how the md files, URLs and other stuff is linked together.

emmahodcroft commented 1 year ago

I don't plan to change the build names, as they're used all over.

RE the nextstrain_name -- I'll keep an eye out - I had the same thought. The main reason I am fairly confident is that it turns out a while ago I accidentally got inconsistent about the naming (started using just year-letter) and as far as I can tell I've never noticed any impact of this. This is the main thing that made me confident that we must not be using if anywhere, or I'd have noticed whenever I first started messing it up (probably about a year ago now) or sometime in between.

But agree - cant' be too careful!

I do not expect to change the build_name - totally agree. The only other thing that might be worth exploring changing is display_name as perhaps we'd like to move to something a bit more flexible (perhaps including the pango in some cases, as Nextstrain is somewhat moving to do?). But I'd want to do a separate scope to check how much this is used.

hodcroftlab / covariants

Prepare for Changes in Clade Naming in Nextclade/Nextstrain #371