Open dev-89 opened 2 years ago
Out of curiosity, did you run into rate-limiting yourself? Do you know when it kicked in (roughly)?
There's an export.arxiv.org record for every result from the API, so it should be safe to add the export
subdomain before downloading, but it might be best to manage this with an optional flag in the download_pdf
/download_source
arguments.
We also need to confirm the download behavior when a PDF does not already exist for the export.arxiv.org
record. In the browser, there's an intermediate "we're generating this PDF from source" page (screenshot below), then a redirect to the PDF once it's generated.
These cases must be handled gracefully.
I honestly think this library should default to using export.arxiv.org
for everything, with an optional flag to use the non-robots allowed live site. First thing I did using this library was accidentally fetch a query that got me blocked from using arXiv for several hours. I bet a lot of users run into this, given the default values (default page size of 300000, for example, is enough to get one blocked).
@brandonrobertz this library does use export.arxiv.org
for everything except download URLs: https://github.com/lukasschwab/arxiv.py/blob/678ba9f20ae4a69abd6215b162329f8bd4ab4f91/arxiv/arxiv.py#L513
The difference is that it receives download URLs from the API instead of building them.
Digression: let's chat limits.
default page size of 300000, for example, is enough to get one blocked
The default (Client).page_size
is 100.
If you're interpreting the max_results
limit in README.md, max_results
isn't a page size; it's the maximum number of results across all pages for a search. If (Search).max_results = 300000
and (Client).page_size = 100
, the client will make up to 3000 requests (iff there are ≥300,000 results available).
delay_seconds
. That delay between requests is meant to appease arXiv's rate limits, even for large queries. Did you call (Result).download_pdf
or (Result).download_source
300,000 times? If no, mind opening a separate issue to discuss your use case?
Interesting, sorry about the bad assumption, I didn't realize this used the export site. That's even more perplexing, then. And no I didn't call download_pdf 300k times. I got 403 after attempting to do results = arxiv.Search(query="cat:cs.LG").results()
I can open separate PR.
@brandonrobertz No worries! Happy to advise.
Motivation
The arxiv library uses the .export.arxiv.org subdomain for querying a paper, but downloads the paper directly from arxiv.org. This can result in the problem that the user gets blocked from arxiv, when downloading too many papers.
Solution
A solution would be to modify the paper PDF url to point to the corresponding .export subdomain. In the code for my personal use I simply use:
where paper is a
Result
instance. This solution is lacking though, since the export subdomain does not have to exist. This would need to be checked. I would add this functionality into the_get_pdf_url
method. A boolean flaguser_export
could be introduced, if some users wish to download directy from arxiv.org, even though it is not adviced according to: https://arxiv.org/help/bulk_data under the "Play Nice" section.