nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
35 stars 20 forks source link

Fix: add clade_legacy back via mapping #405

Closed corneliusroemer closed 1 year ago

corneliusroemer commented 1 year ago

Description of proposed changes

This PR fixes ingest to work with the latest Nextclade dataset release: at the moment it fails due to clade_legacy no longer being output by Nextclade starting with dataset release 2023-06-16.

(This is due to https://github.com/neherlab/nextclade_data_workflows/pull/42 which in turn was triggered by a refactor in ncov of how we annotate clades https://github.com/nextstrain/ncov/pull/1065.)

So as not to break downstream workflows that rely on ingest output metadata.tsv having clade_legacy, this PR adds a clade_legacy column to metadata.tsv

The values are defined as a simple mapping from clade_nextstrain (year-letter, e.g. 22F) to clade_legacy in defaults/clade-legacy-mapping.yml

This file lives in ingest for now to make this PR work without requiring changes to ncov.

Testing

Future work

emmahodcroft commented 1 year ago

@joverlee521 Every time I step away from this issue I have to go back and try to remember it all again (which I haven't done this morning) but my memory is that hopefully these steps are unwrapping these circular dependencies...

Nextclade now no longer knows about the legacy clade names from Nextstrain & thus doesn't output them. But due to that, we have to add them back into Nextstrain metadata to support legacy use (at least until such time as we decide to stop supporting that, with some warning or etc). This should mean that future changes to Nextclade should be more independent - and same on Nextstrain end as far as how we want to rename clades or map them to various labels, etc....

That's the plan anyway, but I'm a bit lost on what step we're on ATM TBH 🤷 !

corneliusroemer commented 1 year ago

@emmahodcroft your summary is entirely correct. Legacy definitions are now an ingest-only concern for backwards compatibility.

corneliusroemer commented 1 year ago