nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
35 stars 20 forks source link

Fix/rki tsv csv #409

Closed rneher closed 12 months ago

joverlee521 commented 1 year ago

Looks like there were additional changes to the headers of the file as well. I'll be pushing up more changes shortly.

joverlee521 commented 1 year ago

Triggered a test run with the latest changes.

I'll compare the final metadata output from then test run to the current file on S3 when the test run is complete to make sure there are not weird changes.

joverlee521 commented 1 year ago

The test run completed successfully and uploaded files to the staging S3 bucket.

Comparing the RKI entries in s3://nextstrain-staging/files/ncov/open/branch/fix/rki-tsv-csv/metadata.tsv.zst with RKI entries in the current file at s3://nextstrain-data/files/ncov/open/metadata.tsv.zst

aws s3 cp s3://nextstrain-staging/files/ncov/open/branch/fix/rki-tsv-csv/metadata.tsv.zst data/rki-test-metadata.tsv.zst
aws s3 cp s3://nextstrain-data/files/ncov/open/metadata.tsv.zst` data/open-metadata.tsv.zst

csvtk filter2 -t -f '$database=="rki"' data/rki-test-metadata.tsv.zst > data/rki-test-rki-only.tsv 
csvtk filter2 -t -f '$database=="rki"' data/open-metadata.tsv.zst > data/open-rki-only.tsv 

csv-diff data/open-rki-only.tsv data/rki-test-rki-only.tsv --key strain > data/rki-only.diff 

Summary of changes:

633970 rows changed, 7689 rows added, 52473 rows removed

Not much we can do about the removed records, but I took a look at the first 1000 changed records

Most changes were just the date_submitted and sampling_strategy fields. There are now a lot of empty date_submitted fields but the diff did allow me to catch a new expected date format that I've added in 3a7f3d4fb207e44c11cd96f4a828c0d1f1d0711d. The sampling_strategy field changes are just simplifications of the codes that RKI use for sequencing reasons.

There are some changes to pango_lineage but these values come directly from RKI metadata. There are a few changes to clock_deviation but I believe that's expected.

corneliusroemer commented 12 months ago

I opened an issue in the RKI repo to let them know that most submission dates have gone missing with the new format: https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland/issues/50

If this doesn't get fixed upstream we could patch the submission dates back using dates from the old format, but let's wait for a bit to save us unnecessary work.