nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

vdb/flu_upload: parse `submitting_lab` from XLS #153

Closed joverlee521 closed 2 months ago

joverlee521 commented 2 months ago

CDC folks flagged that our seasonal flu builds¹ contain a weird submitting lab ("$ins Submitting Name") for many sequences.

I found that these values were coming directly from our FASTA headers in the download from GISAID.² This commit changes our flu upload script to parse the submitting lab from the metadata XLS file instead of the FASTA header since the XLS file contains the correct values.

When we download the sequences with vdb/flu_download, the rethinkdb_download merge command³ will give preference to the submitting_lab of the flu_viruses table according to the rethinkdb docs.⁴

When there is a conflict between field names, preference is given to fields in the rightmost object in the argument list.

So to correct the existing sequences within fauna, we just need to re-upload the sequences that contain the bad submitting lab name.

¹ https://nextstrain.org/flu/seasonal/h3n2/ha/2y@2024-04-18?f_submitting_lab=%24ins%20Submitting%20Name ² https://bedfordlab.slack.com/archives/C03KWDET9/p1713806442592619 ³ https://github.com/nextstrain/fauna/blob/8471abf1287004e5dc2af66d80b178abbbfc8d4c/vdb/download.py#L144https://rethinkdb.com/api/python/merge/

joverlee521 commented 2 months ago

I'm going to test this out in test_vdb database.

joverlee521 commented 2 months ago

07f3053 was enough to correct the submitting lab of the sequences downloaded with vdb/flu_download.py. However I needed to add 2f0334c to fix the submitting_lab values within the vdb/flu_sequences table with --overwrite.

joverlee521 commented 2 months ago

I'm going to merge and re-upload with --overwrite to fix the existing sequences within the database.

Looking at the currently hosted metadata files on S3, these weird submission lab names started showing up for sequences submitted 2024/02/29. I'll plan to download and re-upload the sequences submitted from 2024/02/28 to 2024/04/18 (our last upload date).

joverlee521 commented 2 months ago

Re-uploaded sequences with --overwrite and running the seasonal-flu upload/builds.