lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.11k stars 123 forks source link

Requesting further publications exceeding max_results? #151

Closed janklos closed 12 months ago

janklos commented 12 months ago

Hi,

I'd like to query all publications for a specific keyword e.g. "quantum" for a specific year, which can easily exceed max_results.

How can I query the "other" publications which are not among the first max_results ?

In an earlier version of this wrapper, this feature has been possible with the "start" parameter.

Best, Jan.

lukasschwab commented 12 months ago

Hi @janklos — thanks for your question! Here are a few clarifications.

In an earlier version of this wrapper, this feature has been possible with the "start" parameter.

In recent versions, you can control the initial start parameter by passing a non-zero offset value to results (docs). That parameter determines the initial start for the first API call.

An example — this code skips the first 213 results:

import arxiv
s = arxiv.Search(query="quantum", max_results=1000)
results = arxiv.Client().results(s, offset=213)

How can I query the "other" publications which are not among the first max_results?

I'm not sure I'm interpreting this correctly.

Unfortunately, this API doesn't support querying by dates. That's a frequently requested feature.

Please let me know if this doesn't answer your question. Unfortunately, I think you may be running into a limitation in the underlying API. Consider OAI-PMH!

janklos commented 12 months ago

Hi @lukasschwab thanks for your quick reply!

I somehow have the impression that using offset is not the same as using startin the earlier version and hence the overall search is limited to max_results and the time it takes to query max_results results.

From a few tests, it seems that the offset only skips the first results independent of the query instead of offsetting the search. Hence, one might as well set it to zero without losing information or gaining time.

For my specific use case of searching for publication in e.g. 2014, I would have to set a suitably large max_results e.g. 10.000 and wait until the query is finished. The problem with this approach is that these queries scale very unfavorably in time (independent of offset) and I need to assume a suitably large max_results value. In the past and using the recent versions, I iteratively queried e.g. 100 results in reasonable time and adjusted the start parameter according to my needs.

If this use of offset is intended, I guess I have to follow your suggestion and consider OAI-PMH.

lukasschwab commented 12 months ago

Hi @janklos — thanks for clarifying! The newer client versions still initialize start as you'd expect. It sounds like you were previously paginating manually (advancing the start parameter); now that's handled internally, up to max_results.

Unfortunately, in this process the meaning of max_results has diverged somewhat from its meaning in the API documentation. The max_results setting on a Search controls the total number of results the resulting generator can return, not the page size.

Turning on detailed logging may make the actual behavior clearer:

>>> import arxiv
>>> import logging
>>> logging.basicConfig(level=logging.INFO)
>>>
>>> client = arxiv.Client(page_size=10)
>>> search = arxiv.Search(query="quantum", max_results=100)
>>> generator = client.results(search, offset=13)
>>>
>>> results = list(generator)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=13&max_results=10
INFO:arxiv:Got first page: 10 of 385452 total results
INFO:arxiv:Sleeping: 2.934796 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=23&max_results=10
INFO:arxiv:Sleeping: 2.983968 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=33&max_results=10
INFO:arxiv:Sleeping: 2.986775 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=43&max_results=10
INFO:arxiv:Sleeping: 2.984653 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=53&max_results=10
INFO:arxiv:Sleeping: 2.986806 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=63&max_results=10
INFO:arxiv:Sleeping: 2.984263 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=73&max_results=10
INFO:arxiv:Sleeping: 2.984475 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=83&max_results=10
INFO:arxiv:Sleeping: 2.985515 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=93&max_results=10
>>>
>>> len(results)
87

The generator starts at the specified offset (13), and fetches pages of results until max_results (100).

This may suit your 2014 example; e.g.

import arxiv
client = arxiv.Client(page_size=10)
search = arxiv.Search(query="test", max_results=10000)
generator = client.results(search, offset=9900) # Makes 10 requests, yields 100 results.