Closed joverlee521 closed 2 months ago
I'm going to test this out in test_vdb
database.
test_vdb
using master
branchvdb/flu_download
test_vdb
using this PR w/ --overwrite
flag07f3053 was enough to correct the submitting lab of the sequences downloaded with vdb/flu_download.py
. However I needed to add 2f0334c to fix the submitting_lab
values within the vdb/flu_sequences
table with --overwrite
.
I'm going to merge and re-upload with --overwrite
to fix the existing sequences within the database.
Looking at the currently hosted metadata files on S3, these weird submission lab names started showing up for sequences submitted 2024/02/29. I'll plan to download and re-upload the sequences submitted from 2024/02/28 to 2024/04/18 (our last upload date).
Re-uploaded sequences with --overwrite
and running the seasonal-flu upload/builds.
CDC folks flagged that our seasonal flu builds¹ contain a weird submitting lab ("$ins Submitting Name") for many sequences.
I found that these values were coming directly from our FASTA headers in the download from GISAID.² This commit changes our flu upload script to parse the submitting lab from the metadata XLS file instead of the FASTA header since the XLS file contains the correct values.
When we download the sequences with
vdb/flu_download
, therethinkdb_download
merge command³ will give preference to thesubmitting_lab
of theflu_viruses
table according to the rethinkdb docs.⁴So to correct the existing sequences within fauna, we just need to re-upload the sequences that contain the bad submitting lab name.
¹ https://nextstrain.org/flu/seasonal/h3n2/ha/2y@2024-04-18?f_submitting_lab=%24ins%20Submitting%20Name ² https://bedfordlab.slack.com/archives/C03KWDET9/p1713806442592619 ³ https://github.com/nextstrain/fauna/blob/8471abf1287004e5dc2af66d80b178abbbfc8d4c/vdb/download.py#L144 ⁴ https://rethinkdb.com/api/python/merge/