nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
36 stars 20 forks source link

Metadata includes Nextstrain clades in the `Nextclade_pango` column #456

Closed joverlee521 closed 2 months ago

joverlee521 commented 3 months ago

TODOs

Context

Determined as the root cause for https://github.com/nextstrain/forecasts-ncov/issues/104

Both the private GISAID and public open metadata include Nextstrain clades (e.g. 24A (JN.1)) in the Nextclade_pango column.

The timing of the downstream errors coincides with the latest release of Nextclade datasets and the latest release of Nextclade v3.8.0

joverlee521 commented 3 months ago

The workflow blinding appends Nextclade TSV outputs. I'm assuming something changed in the outputs between Nextclade versions, leading to this error. (Will dig into this later...)

I'm going to cancel the currently running ncov-ingest jobs, upload the *.renew files to trigger a full Nextclade run, and re-run the ingest jobs.

joverlee521 commented 3 months ago

Uploaded *.renew files

$ aws s3 cp - s3://nextstrain-ncov-private/nextclade.tsv.zst.renew < /dev/null
$ aws s3 cp - s3://nextstrain-data/files/ncov/open/nextclade.tsv.zst.renew < /dev/null
$ aws s3 cp - s3://nextstrain-ncov-private/nextclade_21L.tsv.zst.renew < /dev/null
$ aws s3 cp - s3://nextstrain-data/files/ncov/open/nextclade_21L.tsv.zst.renew < /dev/null

Running GISAID and open workflows.

joverlee521 commented 3 months ago

Ah right, open workflow is blocked on https://github.com/nextstrain/ncov-ingest/issues/455. Oh how all stars aligned this weekend 😅

At least we'll resolve the issue in GISAID metadata for now.

joverlee521 commented 3 months ago

I'm assuming something changed in the outputs between Nextclade versions, leading to this error. (Will dig into this later...)

Nextclade 3.8.0 added new columns for the new relative mutations feature and changed the order of existing output columns:

order nextclade-3.7_columns_name nextclade-3.8_columns_name
1 index index
2 seqName seqName
3 clade clade
4 Nextclade_pango clade_display
5 partiallyAliased clade_who
6 clade_nextstrain clade_nextstrain
7 clade_who partiallyAliased
8 clade_display Nextclade_pango

The change in ordering is the root cause of this issue, but the new columns would have messed things up too.

Maybe to guard against this, we should at least have a check that the new Nextclade outputs columns match the columns of the Nextclade cache.

corneliusroemer commented 3 months ago

Oh no, I always did renew whenever I updated Nextclade datasets, but the software version wasn't on my radar.

Thanks for tracking this down!

ivan-aksamentov commented 3 months ago

That's odd! The order was not supposed to be changed.

~Would be nice to remove dependency on order where possible. Though I guess this is where the old rows are concatenated to the new rows, so this is not trivial to do. I'm thinking storing only the necessary columns in a predefined order might help. And then also transforming the incoming rows to the same shape.~ Actually, I just saw https://github.com/nextstrain/ncov-ingest/issues/457 and this is easier and more reliable.

Also I need to be more careful when releasing stuff :)

corneliusroemer commented 2 months ago

Also I need to be more careful when releasing stuff :)

I don't think this is at all on you @ivan-aksamentov - TSV columns can change, we shouldn't rely on stability and invalidate cache properly.

joverlee521 commented 2 months ago

Oy, the GISAID ingest is still running after 20+ hours...if it runs past 24 hours, the GH Action workflow will show success/complete, but the job will continue to run on AWS Batch.

(I will add final stats.json to https://github.com/nextstrain/ncov-ingest/issues/446)

corneliusroemer commented 2 months ago

Ah yeah - we almost never run with the 21L touchfiles, so it's quite some extra work to do. Choosing a larger machine in the future might work if we know there's a pending full rerun. Nextclade parallelizes very well, so can go to a machine with say 64 cores or even larger.

joverlee521 commented 2 months ago

Nextclade parallelizes very well, so can go to a machine with say 64 cores or even larger.

Oh, I'm not sure the workflow is utilizing this because there are no threads defined for the Snakemake rule.

corneliusroemer commented 2 months ago

Oh, I'm not sure the workflow is utilizing this because there are no threads defined for the Snakemake rule.

No need for threads, nextclade uses all by default. What I meant to say was that we can use more cores to speed things up if we find workflows take too long. See comment over at #459 :)

joverlee521 commented 2 months ago

Verified the full reruns of Nextclade fixed the metadata for both GISAID and open.