nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

tdb/download: correct flu type in strain names #138

Closed joverlee521 closed 1 year ago

joverlee521 commented 1 year ago

In the preview of the latest upload of VIDRL data, I realized that the same serum abbreviation "Dar11" was used for both "A/Darwin/11/2021" and "B/Darwin/11/2021" serum strains. This led me to investigate the strain names in the various titer TSVs hosted on S3. There were multiple A/B mixups in almost all of the titer TSVs for both the virus_strain and the serum_strain fields. I spot checked a couple of the mixups and found they were typos in the original Excel tables.

Instead of fixing the various sources and re-uploading them to fauna, I opted to update the tdb/download script to correct the strain names based on the record's subtype. Since our interactions with the titer databases are only through the tdb/download script, this allows us to work the correct strain names in our seasonal flu workflows.

In the future, if/when we want to correct the data within fauna and any future uploads, we need to remember that the strain names are part of the hash for record indexes. We would need to delete the "bad" records in the database and re-upload using the correct names to ensure we don't have duplicate entries.

joverlee521 commented 1 year ago

Tested tdb/download locally and compared results to the output of the master branch to confirm the only differences are the strain names with types mixed up.