nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
36 stars 20 forks source link

Host field is empty #383

Closed chaoran-chen closed 1 year ago

chaoran-chen commented 1 year ago

Since very recently, in the open metadata.tsv, the host field is (mostly or always) empty. Is it intended to remove the field?

This is unfortunately breaking the open instance of CoV-Spectrum at the moment because it only shows human sequences by default (https://github.com/GenSpectrum/LAPIS/issues/70).

tsibley commented 1 year ago

Hmm. https://github.com/nextstrain/ncov-ingest/pull/381 was merged last week. Maybe the host field got broken by that switch?

tsibley commented 1 year ago

Appears so. I downloaded the following two versions of s3://nextstrain-data/files/ncov/open/metadata.tsv.zst, which span the merge of #381:

5FRHFn5_GZf5rYUwWsdYsFrGTyXPYpkU (after merge)
8viIxhsktDA13zUrJd.0aJap1u.V.pNi (before merge)

and diffed the first 1,000 rows of them.

Summarizing the changes in that diff:

@joverlee521 @corneliusroemer Where these expected changes? I didn't see note of them in #381.

joverlee521 commented 1 year ago

Where these expected changes? I didn't see note of them in https://github.com/nextstrain/ncov-ingest/pull/381.

The changes for the fields date_updated, submitting_lab and title were expected, as noted in commits https://github.com/nextstrain/ncov-ingest/pull/381/commits/60d603fd3b43b24cb1274a88158ebe78cde8e1e7 and https://github.com/nextstrain/ncov-ingest/pull/381/commits/8ae8d9ac87d2774887678d5af7ae466586cfb852.

The loss of host is unexpected. The values are also empty in s3://nextstrain-data/files/ncov/open/genbank.ndjson.zst, so the data is getting lost somewhere upstream...

We are currently using the host-common-name field, which appears to be empty when I download/format a subset of sequences with:

datasets download virus genome taxon SARS-CoV-2 --filename data/ncbi_datasets.zip --released-after 01/01/2023
dataformat tsv virus-genome --package data/ncbi_datasets.zip --fields accessions,host-common-name,host-name > data/ncbi_metadata.tsv

The other host-name field that is filled in with the scientific name, so I'll just make the changes to use that field.

emmahodcroft commented 1 year ago

@chaoran-chen Jover's PR above should hopefully resolve this and put back the scientific name of the host - but after this is merged in, please do keep an eye and let us know if this seems to be working!

chaoran-chen commented 1 year ago

Fantastic, yes, I'll do that. Thanks @joverlee521 and everyone!

chaoran-chen commented 1 year ago

The host information is now back in LAPIS open, thanks again! :)

emmahodcroft commented 1 year ago

Wonderful to hear, thanks for letting us know @chaoran-chen ! 😁