This pull request includes a series of simple changes to the avian-flu upload protocol to account for a few new features:
GISAID recently began including clade information into the downloadable metadata. This is useful information to download, especially because it includes up to date clade 2.3.4.4 and 2.3.2.1 splits. This pull request includes new download and upload fields for fauna.
I've separated out the strain name fixes into their own file that is distinct from the seasonal flu strain name fixes. The file was getting enormous, and there are some very specific, recurrent errors in the avian flu database that are just easier to deal with on their own. Avian flu strain name fixes are now included in avian_flu_strain_name_fix.tsv.
I also added a check to print out strains that have too many partitions. This is a pretty common occurrence in the avian flu data, where data submitters will include extra information in the strain name as extra partitions. This messes up the host and country parsing. I wrote a small check to print out these strain names so that they can be added to avian_flu_strain_name_fix.tsv
Testing
I tested this out in test_vdb with a series of increasingly larger uploads. To incorporate the GISAID clade field into existing data, I re-downloaded all available H5Nx sequences uploaded from 1996 to today, and tested their upload in test_vdb. After confirming that the fields were appropriately parsed and uploaded into test_vdb, I uploaded to vdb and confirmed that the new clade field had been properly added.
Description of proposed changes
This pull request includes a series of simple changes to the avian-flu upload protocol to account for a few new features:
GISAID recently began including clade information into the downloadable metadata. This is useful information to download, especially because it includes up to date clade 2.3.4.4 and 2.3.2.1 splits. This pull request includes new download and upload fields for fauna.
I've separated out the strain name fixes into their own file that is distinct from the seasonal flu strain name fixes. The file was getting enormous, and there are some very specific, recurrent errors in the avian flu database that are just easier to deal with on their own. Avian flu strain name fixes are now included in
avian_flu_strain_name_fix.tsv
.I also added a check to print out strains that have too many partitions. This is a pretty common occurrence in the avian flu data, where data submitters will include extra information in the strain name as extra partitions. This messes up the host and country parsing. I wrote a small check to print out these strain names so that they can be added to
avian_flu_strain_name_fix.tsv
Testing
I tested this out in
test_vdb
with a series of increasingly larger uploads. To incorporate the GISAID clade field into existing data, I re-downloaded all available H5Nx sequences uploaded from 1996 to today, and tested their upload intest_vdb
. After confirming that the fields were appropriately parsed and uploaded intotest_vdb
, I uploaded tovdb
and confirmed that the new clade field had been properly added.