sckott / pytaxize

python port of taxize (taxonomy toolbelt) for R
https://sckott.github.io/pytaxize/
MIT License
34 stars 13 forks source link

HTTP Errors. Respect 'X-RateLimit-Remaining' http response header from ncbi entrez #68

Closed dmboyd closed 3 years ago

dmboyd commented 4 years ago

Issue:

I'm getting http errors when running multiple entrez queries within a loop. This is caused by exceeding the entrez api limits.

The http responses for ncbi/entrez programming provide the helpful http response: 'X-RateLimit-Remaining' with a count of how many api responses are remaining within the applicable rate limit.

It'd be nice if the library respected the API limits automatically.

Potential solution:

When a response shows 'X-RateLimit-Remaining' <= 1 , wait for ~1 second before returning the http request to allow the api limit to reset.

sckott commented 4 years ago

thanks for the issue.

not done this in python before. this gives a clue https://gist.github.com/rsperl/085679536bc991e919d628be4fe8e838#max-retries - the HTTPAdapter class in requests - but may need to dip into urllib3.Retry for more control

it'd be nice if there was an out of the box solution for this in a package on pypi or so, know of anything?

dmboyd commented 4 years ago

I've submitted a fairly simple PR to pause on API depletion using the out of the box method in requests.

Longer term, perhaps it makes sense to wrap the ncbi methods within a class for request.session reuse/throttling, and/or utilize the http:// post api pattern described within link to utilize the library to bank uids for larger queries (which exceed http url length limits when placed within params).

@kmeiklej perhaps a relatively easy area to target. Translating sample entrez code from perl to python should be straight forward.

sckott commented 4 years ago

thanks, having a look at the PR.

one option is biopython, which has an entrez module https://biopython.org/docs/1.75/api/Bio.Entrez.html Seems to handle rate limiting out of the box. Though when i installed it it immediately threw some curl errors, so that doesn't give me hope it would be something to depend on.

dmboyd commented 3 years ago

Biopython entrez are certainly hitting that api in a more efficient way (dynamically utilising post vs get). But the model implementation (in Perl) has some api changes that don’t appear to be back ported to biopython, and you’re right; requests is a better approach than using curl/urllib. Closing this specific issue, but will leave for @kmeiklej to raise an issue to address.