dev-89 commented 2 years ago

Motivation

The arxiv library uses the .export.arxiv.org subdomain for querying a paper, but downloads the paper directly from arxiv.org. This can result in the problem that the user gets blocked from arxiv, when downloading too many papers.

Solution

A solution would be to modify the paper PDF url to point to the corresponding .export subdomain. In the code for my personal use I simply use:

idx = paper.pdf_url.index('arxiv')
paper.pdf_url = paper.pdf_url[:idx] + 'export.' + paper.pdf_url[idx:]

where paper is a Result instance. This solution is lacking though, since the export subdomain does not have to exist. This would need to be checked. I would add this functionality into the _get_pdf_url method. A boolean flag user_exportcould be introduced, if some users wish to download directy from arxiv.org, even though it is not adviced according to: https://arxiv.org/help/bulk_data under the "Play Nice" section.

lukasschwab commented 2 years ago

Out of curiosity, did you run into rate-limiting yourself? Do you know when it kicked in (roughly)?

There's an export.arxiv.org record for every result from the API, so it should be safe to add the export subdomain before downloading, but it might be best to manage this with an optional flag in the download_pdf/download_source arguments.

We also need to confirm the download behavior when a PDF does not already exist for the export.arxiv.org record. In the browser, there's an intermediate "we're generating this PDF from source" page (screenshot below), then a redirect to the PDF once it's generated.

These cases must be handled gracefully.

brandonrobertz commented 2 years ago

I honestly think this library should default to using export.arxiv.org for everything, with an optional flag to use the non-robots allowed live site. First thing I did using this library was accidentally fetch a query that got me blocked from using arXiv for several hours. I bet a lot of users run into this, given the default values (default page size of 300000, for example, is enough to get one blocked).

lukasschwab commented 2 years ago

@brandonrobertz this library does use export.arxiv.org for everything except download URLs: https://github.com/lukasschwab/arxiv.py/blob/678ba9f20ae4a69abd6215b162329f8bd4ab4f91/arxiv/arxiv.py#L513

The difference is that it receives download URLs from the API instead of building them.

Digression: let's chat limits.

default page size of 300000, for example, is enough to get one blocked

The default (Client).page_size is 100.

If you're interpreting the max_results limit in README.md, max_results isn't a page size; it's the maximum number of results across all pages for a search. If (Search).max_results = 300000 and (Client).page_size = 100, the client will make up to 3000 requests (iff there are ≥300,000 results available).

Maybe there should be a lower default.
Maybe there's a bug in the client code around delay_seconds. That delay between requests is meant to appease arXiv's rate limits, even for large queries.

Did you call (Result).download_pdf or (Result).download_source 300,000 times? If no, mind opening a separate issue to discuss your use case?

brandonrobertz commented 2 years ago

Interesting, sorry about the bad assumption, I didn't realize this used the export site. That's even more perplexing, then. And no I didn't call download_pdf 300k times. I got 403 after attempting to do results = arxiv.Search(query="cat:cs.LG").results()

I can open separate PR.

lukasschwab commented 2 years ago

@brandonrobertz No worries! Happy to advise.

lukasschwab / arxiv.py

Enable user to use .export for PDF download #87

Motivation

Solution