nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Wip h5nx #106

Closed lmoncla closed 3 years ago

lmoncla commented 3 years ago

Description of proposed changes

This pull request adds in functionality to upload and correctly parse H5Nx sequences from gisaid into fauna. The primary changes made include:

  1. Adding into new expected subtypes (H5N1-H5N9) into self.patterns
  2. Adding in new host species into the format_host function to accommodate the new host species in the new set of sequences
  3. A small edit to the function fix_name, which applies changes to the geographic locations in the strain name to standardize them. There was an error cropping up in which some sequences had random identifiers that matched geo ids. For example: A/chicken/Hubei/wi/1997 was getting changed to A/chicken/Hubei/Wisconsin/1997, which is improper. To fix this behavior, I restricted the geo fixing such that it was not applied to the last 2 fields of the strain name.
  4. To test H5 clade classification, I added h5_clade as a field to download into avian_flu_downloady.py

Testing

I iteratively tested the uploads into test_vdb until upload errors were resolved.

jameshadfield commented 3 years ago

Really exciting!