nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Automatic Uploading of Sequences from Genbank #14

Closed chacalle closed 8 years ago

chacalle commented 8 years ago

Create function to regularly search through entrez for new sequences to upload. I think this would be useful if there are multiple nextstrain websites being maintained, could automate retrieving new sequences from genbank.

Can query with entrez like "Zika virus"[porgn] AND ("2015/01/01"[MDAT] : "2016-04-14"[MDAT]) AND ("10000"[SLEN] : "100000000"[SLEN]). Possibly also only include sequences that include complete genome in their description. Using entrez seems to lag slightly behind manually searching genbank (missing new sequences KX051563, KX056898 at the moment).

Will want some sort of staging area that shows important sequence information where someone could approve sequences for uploading. Possibly email new sequence information and accession numbers to user for approval?

trvrb commented 8 years ago

Very cool idea. Thoughts on making this "safe" so that unwanted sequences don't end up in the final database...

Could upload to test database rather than vdb. Or better yet, we could think about making a vdb_staging database that would mirror tables within the vdb database and there could be a script to copy vdb_staging into vdb.

trvrb commented 8 years ago

This is a great direction. I'm going to shelve for the moment. Now that the data is basically working, we could start to think about smarter upload scripts. This should integrate well with the overall nextstrain pipeline.

chacalle commented 8 years ago

If the VIPR had an API we could automate the search and download we're doing for ZIKA and other viruses in the future.

trvrb commented 8 years ago

Yes. This would be awesome.