Closed trvrb closed 8 years ago
I will work on this on the plane tomorrow. Like we talked about the temporary solution would be to subset viruses by lineage and sequences by locus on the server. I think I might be able to move more of the select
and present
subsetting to the server but I will see how much the temporary solution improves performance first.
I went with a different temporary solution to this problem. At the beginning of download
instead of downloading all fields from the sequences table, I did sequences = list(r.table(self.sequences_table).without('sequence').run())
which downloads all documents without the sequence
field. Then go through subsetting with all the meta information, then for viruses still left over download their sequence.
The old script took 25 minutes on hotel wifi, the new strategy 10 minutes.
python vdb/flu_download.py -db vdb -v flu --select locus:HA lineage:seasonal_h3n2 --fstem h3n2
Also the select command should be formatted like --select locus:ha lineage:seasonal_h3n2
. With spaces between the select parameters.
This still has room for improvement but is a temporary solution.
Great! I'll test this out. Thanks.
This is working great. Thanks so much @chacalle!
I'm working on getting vdb integrated into the current nextflu build. I need to generate 4 FASTA files, one each for H3N2, H1N1pdm, Vic and Yam. I'm doing this with:
python vdb/flu_download.py -db vdb -v flu --select locus:HA,lineage:seasonal_h3n2 --fstem h3n2
python vdb/flu_download.py -db vdb -v flu --select locus:HA,lineage:seasonal_h1n1pdm --fstem h1n1pdm
python vdb/flu_download.py -db vdb -v flu --select locus:HA,lineage:seasonal_vic --fstem vic
python vdb/flu_download.py -db vdb -v flu --select locus:HA,lineage:seasonal_yam --fstem yam
However, downloading the full database is taking ~5 min per lineage. This is definitely impacting performance. Moving subset logic to server would improve this.