nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Gisaid clade and download changes #140

Closed lmoncla closed 1 year ago

lmoncla commented 1 year ago

Description of proposed changes

This pull request includes a series of simple changes to the avian-flu upload protocol to account for a few new features:

  1. GISAID recently began including clade information into the downloadable metadata. This is useful information to download, especially because it includes up to date clade 2.3.4.4 and 2.3.2.1 splits. This pull request includes new download and upload fields for fauna.

  2. I've separated out the strain name fixes into their own file that is distinct from the seasonal flu strain name fixes. The file was getting enormous, and there are some very specific, recurrent errors in the avian flu database that are just easier to deal with on their own. Avian flu strain name fixes are now included in avian_flu_strain_name_fix.tsv.

  3. I also added a check to print out strains that have too many partitions. This is a pretty common occurrence in the avian flu data, where data submitters will include extra information in the strain name as extra partitions. This messes up the host and country parsing. I wrote a small check to print out these strain names so that they can be added to avian_flu_strain_name_fix.tsv

Testing

I tested this out in test_vdb with a series of increasingly larger uploads. To incorporate the GISAID clade field into existing data, I re-downloaded all available H5Nx sequences uploaded from 1996 to today, and tested their upload in test_vdb. After confirming that the fields were appropriately parsed and uploaded into test_vdb, I uploaded to vdb and confirmed that the new clade field had been properly added.