In the preview of the latest upload of VIDRL data, I realized that the same serum abbreviation "Dar11" was used for both "A/Darwin/11/2021" and "B/Darwin/11/2021" serum strains. This led me to investigate the strain names in the various titer TSVs hosted on S3. There were multiple A/B mixups in almost all of the titer TSVs for both the virus_strain and the serum_strain fields. I spot checked a couple of the mixups and found they were typos in the original Excel tables.
Instead of fixing the various sources and re-uploading them to fauna, I opted to update the tdb/download script to correct the strain names based on the record's subtype. Since our interactions with the titer databases are only through the tdb/download script, this allows us to work the correct strain names in our seasonal flu workflows.
In the future, if/when we want to correct the data within fauna and any future uploads, we need to remember that the strain names are part of the hash for record indexes. We would need to delete the "bad" records in the database and re-upload using the correct names to ensure we don't have duplicate entries.
Tested tdb/download locally and compared results to the output of the master branch to confirm the only differences are the strain names with types mixed up.
In the preview of the latest upload of VIDRL data, I realized that the same serum abbreviation "Dar11" was used for both "A/Darwin/11/2021" and "B/Darwin/11/2021" serum strains. This led me to investigate the strain names in the various titer TSVs hosted on S3. There were multiple A/B mixups in almost all of the titer TSVs for both the virus_strain and the serum_strain fields. I spot checked a couple of the mixups and found they were typos in the original Excel tables.
Instead of fixing the various sources and re-uploading them to fauna, I opted to update the tdb/download script to correct the strain names based on the record's subtype. Since our interactions with the titer databases are only through the tdb/download script, this allows us to work the correct strain names in our seasonal flu workflows.
In the future, if/when we want to correct the data within fauna and any future uploads, we need to remember that the strain names are part of the hash for record indexes. We would need to delete the "bad" records in the database and re-upload using the correct names to ensure we don't have duplicate entries.