kylebittinger / brocc

Consensus taxonomy assignment for short reads (great for fungi)
GNU General Public License v3.0
8 stars 5 forks source link

Data retrieval from NCBI too slow using accession numbers #5

Open kylebittinger opened 8 years ago

kylebittinger commented 8 years ago

GI numbers will be phased out in September, but it is unclear how the esearch and elink utilities will work in this context.

Currently, the remove-gi-numbers branch uses efetch to retrieve a summary of nucleotide records, then extracts a taxon ID from this XML document. However, this process is very slow.

The ideal solution would be to use esearch or a local database. However, the accesion-to-taxonid database is > 1GB unzipped, and we would have to introduce new code for downloading it automatically. Alternately, we could use elink if it is updated in September to work with accession numbers.

The solution will depend on implementation at NCBI, therefore we will wait to implement.