This pull request adds in functionality to upload and correctly parse H5Nx sequences from gisaid into fauna. The primary changes made include:
Adding into new expected subtypes (H5N1-H5N9) into self.patterns
Adding in new host species into the format_host function to accommodate the new host species in the new set of sequences
A small edit to the function fix_name, which applies changes to the geographic locations in the strain name to standardize them. There was an error cropping up in which some sequences had random identifiers that matched geo ids. For example: A/chicken/Hubei/wi/1997 was getting changed to A/chicken/Hubei/Wisconsin/1997, which is improper. To fix this behavior, I restricted the geo fixing such that it was not applied to the last 2 fields of the strain name.
To test H5 clade classification, I added h5_clade as a field to download into avian_flu_downloady.py
Testing
I iteratively tested the uploads into test_vdb until upload errors were resolved.
Description of proposed changes
This pull request adds in functionality to upload and correctly parse H5Nx sequences from gisaid into fauna. The primary changes made include:
self.patterns
format_host
function to accommodate the new host species in the new set of sequencesfix_name
, which applies changes to the geographic locations in the strain name to standardize them. There was an error cropping up in which some sequences had random identifiers that matched geo ids. For example:A/chicken/Hubei/wi/1997
was getting changed toA/chicken/Hubei/Wisconsin/1997
, which is improper. To fix this behavior, I restricted the geo fixing such that it was not applied to the last 2 fields of the strain name.h5_clade
as a field to download intoavian_flu_downloady.py
Testing
I iteratively tested the uploads into
test_vdb
until upload errors were resolved.