lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.07k stars 120 forks source link

Very slow to extract more than 1000 articles #89

Closed waqarg2001 closed 2 years ago

waqarg2001 commented 2 years ago

When max_result is more than 1000, it is very slow to retrieve data. Why is that?

lukasschwab commented 2 years ago

Can you share your client usage and an example query? Can you qualify "very slow," or describe what kind of performance you want to see?

Depending on the client configuration, this operation could be doing anything from

Waiting on a large page may mean the API itself is slow. This is only a client library; it can't change the performance of the hosted arXiv API service.

Sending many requests means repeatedly incurring the overhead cost of an HTTP round-trip. Each round trip after the first also waits (Client).delay_seconds before running: https://github.com/lukasschwab/arxiv.py#client

The default client will get ten pages of 100 results each, and wait three seconds between each request. This seems like the most likely cause of the performance you're seeing, but it's intended behavior. See arXiv's API Terms of Use: https://arxiv.org/help/api/tou#rate-limits