Closed rneher closed 12 months ago
Triggered a test run with the latest changes.
I'll compare the final metadata output from then test run to the current file on S3 when the test run is complete to make sure there are not weird changes.
The test run completed successfully and uploaded files to the staging S3 bucket.
Comparing the RKI entries in s3://nextstrain-staging/files/ncov/open/branch/fix/rki-tsv-csv/metadata.tsv.zst
with RKI entries in the current file at s3://nextstrain-data/files/ncov/open/metadata.tsv.zst
aws s3 cp s3://nextstrain-staging/files/ncov/open/branch/fix/rki-tsv-csv/metadata.tsv.zst data/rki-test-metadata.tsv.zst
aws s3 cp s3://nextstrain-data/files/ncov/open/metadata.tsv.zst` data/open-metadata.tsv.zst
csvtk filter2 -t -f '$database=="rki"' data/rki-test-metadata.tsv.zst > data/rki-test-rki-only.tsv
csvtk filter2 -t -f '$database=="rki"' data/open-metadata.tsv.zst > data/open-rki-only.tsv
csv-diff data/open-rki-only.tsv data/rki-test-rki-only.tsv --key strain > data/rki-only.diff
Summary of changes:
633970 rows changed, 7689 rows added, 52473 rows removed
Not much we can do about the removed records, but I took a look at the first 1000 changed records
Most changes were just the date_submitted
and sampling_strategy
fields. There are now a lot of empty date_submitted
fields but the diff did allow me to catch a new expected date format that I've added in 3a7f3d4fb207e44c11cd96f4a828c0d1f1d0711d. The sampling_strategy
field changes are just simplifications of the codes that RKI use for sequencing reasons.
There are some changes to pango_lineage
but these values come directly from RKI metadata. There are a few changes to clock_deviation
but I believe that's expected.
I opened an issue in the RKI repo to let them know that most submission dates have gone missing with the new format: https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland/issues/50
If this doesn't get fixed upstream we could patch the submission dates back using dates from the old format, but let's wait for a bit to save us unnecessary work.
Looks like there were additional changes to the headers of the file as well. I'll be pushing up more changes shortly.