krassowski / easy-entrez

Retrieve PubMed articles, text-mining annotations, or molecular data from >35 Entrez databases via easy to use Python package - built on top of Entrez E-utilities API.
https://easy-entrez.readthedocs.io/en/latest/
GNU Lesser General Public License v3.0
69 stars 6 forks source link

Batch querying #5

Closed Fnyasimi closed 2 years ago

Fnyasimi commented 2 years ago

Hi @krassowski thanks for the easy API!

I was wondering if there is a way to query in batches. I have a list of 1000 coordinates I want to query for rsids. I would have done it in a for-loop but the API is set to limit to 3 queries per second which becomes impossible to implement.

My main question is there a method I can use to query the 1000 coordinates to get their rsids without using a loop? I believe this would be efficient and faster besides bypassing the rate limit set by NCBI.

krassowski commented 2 years ago

Thank you for trying out this package. I don't see an easy way out; it is not a limitation of the easy_entrez package, but it is just the way the Entrez API was designed: the EFetch, ELink and ESummary endpoints do support multi-item requests (and batching is supported in easy_entrez if larger than allowed collections are to be used; see in_batches_of in the reference and the demo notebook), but the ESearch endpoint does not - it accepts one term only.

I guess it is because searching is the most expensive of the operations. I would either accept that it will take time, or use a different tool (something from vcf/bcf tools family or new NCBI variation API: https://api.ncbi.nlm.nih.gov/variation/v0/ SPDI rsid endpoint).

Fnyasimi commented 2 years ago

@krassowski Thank you for the feedback I will explore this further.