nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

GISAID Upload Pipeline #10

Closed chacalle closed 8 years ago

chacalle commented 8 years ago

When incorporating new sequences from GISAID into nextflu, are only relatively new sequences downloaded or is everything in GISAID downloaded? vdb_parse currently parses the fasta before trying to upload each sequence and checking if the virus is already in vdb. If all sequences from GISAID are going to be in the fasta each time, it will take a while to determine the lineage for all sequences. In this case vdb_parse should immediately check for the virus in vdb after getting the strain name. If only relatively new GISAID sequences are in the fasta then this isn't a problem.

trvrb commented 8 years ago

There's not a good way to just download new viruses from GISAID. I usually download 20,000 viruses at a time (roughly the last two years), which is the maximum that is allowed by GISAID. So, yes, it would be better to first look in the database to see if a strain exists before trying to determine its lineage.

trvrb commented 8 years ago

This has been resolved by vdb/flu_update.py -db vdb -v flu --update_groupings.