Closed chaoran-chen closed 1 year ago
Hmm. https://github.com/nextstrain/ncov-ingest/pull/381 was merged last week. Maybe the host field got broken by that switch?
Appears so. I downloaded the following two versions of s3://nextstrain-data/files/ncov/open/metadata.tsv.zst
, which span the merge of #381:
5FRHFn5_GZf5rYUwWsdYsFrGTyXPYpkU (after merge)
8viIxhsktDA13zUrJd.0aJap1u.V.pNi (before merge)
and diffed the first 1,000 rows of them.
Summarizing the changes in that diff:
host
column values lost; used to be 100% Homo sapiens
and are now 100% emptydate_updated
column addedsubmitting_lab
column changed from ?
to actual values, e.g. Center for Research on Influenza Pathogenesis (CRIP), CEIRS Data Processing and Coordinating Center
title
column values lost; used to be, e.g. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/ARG/Cordoba-1006-155/2020, complete genome
and are now 100% empty@joverlee521 @corneliusroemer Where these expected changes? I didn't see note of them in #381.
Where these expected changes? I didn't see note of them in https://github.com/nextstrain/ncov-ingest/pull/381.
The changes for the fields date_updated
, submitting_lab
and title
were expected, as noted in commits https://github.com/nextstrain/ncov-ingest/pull/381/commits/60d603fd3b43b24cb1274a88158ebe78cde8e1e7 and https://github.com/nextstrain/ncov-ingest/pull/381/commits/8ae8d9ac87d2774887678d5af7ae466586cfb852.
The loss of host
is unexpected. The values are also empty in s3://nextstrain-data/files/ncov/open/genbank.ndjson.zst
, so the data is getting lost somewhere upstream...
We are currently using the host-common-name
field, which appears to be empty when I download/format a subset of sequences with:
datasets download virus genome taxon SARS-CoV-2 --filename data/ncbi_datasets.zip --released-after 01/01/2023
dataformat tsv virus-genome --package data/ncbi_datasets.zip --fields accessions,host-common-name,host-name > data/ncbi_metadata.tsv
The other host-name
field that is filled in with the scientific name, so I'll just make the changes to use that field.
@chaoran-chen Jover's PR above should hopefully resolve this and put back the scientific name of the host - but after this is merged in, please do keep an eye and let us know if this seems to be working!
Fantastic, yes, I'll do that. Thanks @joverlee521 and everyone!
The host information is now back in LAPIS open, thanks again! :)
Wonderful to hear, thanks for letting us know @chaoran-chen ! 😁
Since very recently, in the open metadata.tsv, the host field is (mostly or always) empty. Is it intended to remove the field?
This is unfortunately breaking the open instance of CoV-Spectrum at the moment because it only shows human sequences by default (https://github.com/GenSpectrum/LAPIS/issues/70).