nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Subset download on server #31

Closed trvrb closed 8 years ago

trvrb commented 8 years ago

I'm working on getting vdb integrated into the current nextflu build. I need to generate 4 FASTA files, one each for H3N2, H1N1pdm, Vic and Yam. I'm doing this with:

However, downloading the full database is taking ~5 min per lineage. This is definitely impacting performance. Moving subset logic to server would improve this.

chacalle commented 8 years ago

I will work on this on the plane tomorrow. Like we talked about the temporary solution would be to subset viruses by lineage and sequences by locus on the server. I think I might be able to move more of the select and present subsetting to the server but I will see how much the temporary solution improves performance first.

chacalle commented 8 years ago

I went with a different temporary solution to this problem. At the beginning of download instead of downloading all fields from the sequences table, I did sequences = list(r.table(self.sequences_table).without('sequence').run()) which downloads all documents without the sequence field. Then go through subsetting with all the meta information, then for viruses still left over download their sequence.

The old script took 25 minutes on hotel wifi, the new strategy 10 minutes. python vdb/flu_download.py -db vdb -v flu --select locus:HA lineage:seasonal_h3n2 --fstem h3n2

Also the select command should be formatted like --select locus:ha lineage:seasonal_h3n2. With spaces between the select parameters.

chacalle commented 8 years ago

This still has room for improvement but is a temporary solution.

trvrb commented 8 years ago

Great! I'll test this out. Thanks.

trvrb commented 8 years ago

This is working great. Thanks so much @chacalle!