nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Parse subtype from sequence #9

Closed trvrb closed 8 years ago

trvrb commented 8 years ago

A large fraction of the GISAID submissions don't include full subtype information. This is especially common for B/Vic and B/Yam. Because of this, asking for A/H3N2 in GISAID won't actually get all the H3N2 sequences. Take a look at what we (Richard) did in the nextflu build to account for this:

https://github.com/blab/nextflu/blob/master/augur/src/make_all.py

This uses BioPython plus the outgroups for H3N2, H1N1pdm, Vic and Yam to make alignments and categorize sequences with ambiguous subtypes. @chacalle do you think you could borrow this code/logic for Flu_vdb_upload.py? With this in place, I could switch to using vdb rather than direct GISAID downloads for my nextflu builds.

chacalle commented 8 years ago

I see. So could use something similar to determine_lineage to assign lineage based on similarity to the outgroup sequences?

trvrb commented 8 years ago

Exactly. Can move the outgroup genbank files to source-data/. I think this only needs to be done when the subtype / lineage isn't already specified in the GISAID fasta.

trvrb commented 8 years ago

This looks to have been resolved in 639c40607131cde81d01e4566f57e33a31e069f2.