Closed janklos closed 12 months ago
Hi @janklos — thanks for your question! Here are a few clarifications.
In an earlier version of this wrapper, this feature has been possible with the "start" parameter.
In recent versions, you can control the initial start
parameter by passing a non-zero offset
value to results
(docs). That parameter determines the initial start
for the first API call.
An example — this code skips the first 213 results:
import arxiv
s = arxiv.Search(query="quantum", max_results=1000)
results = arxiv.Client().results(s, offset=213)
How can I query the "other" publications which are not among the first max_results?
I'm not sure I'm interpreting this correctly.
max_results
: arxiv.Search(query="quantum", max_results=30000)
Unfortunately, this API doesn't support querying by dates. That's a frequently requested feature.
Please let me know if this doesn't answer your question. Unfortunately, I think you may be running into a limitation in the underlying API. Consider OAI-PMH!
Hi @lukasschwab thanks for your quick reply!
I somehow have the impression that using offset
is not the same as using start
in the earlier version and hence the overall search is limited to max_results
and the time it takes to query max_results
results.
From a few tests, it seems that the offset
only skips the first results independent of the query instead of offsetting the search. Hence, one might as well set it to zero without losing information or gaining time.
For my specific use case of searching for publication in e.g. 2014, I would have to set a suitably large max_results
e.g. 10.000 and wait until the query is finished. The problem with this approach is that these queries scale very unfavorably in time (independent of offset
) and I need to assume a suitably large max_results
value.
In the past and using the recent versions, I iteratively queried e.g. 100 results in reasonable time and adjusted the start
parameter according to my needs.
If this use of offset
is intended, I guess I have to follow your suggestion and consider OAI-PMH.
Hi @janklos — thanks for clarifying! The newer client versions still initialize start
as you'd expect. It sounds like you were previously paginating manually (advancing the start
parameter); now that's handled internally, up to max_results
.
Unfortunately, in this process the meaning of max_results
has diverged somewhat from its meaning in the API documentation. The max_results
setting on a Search
controls the total number of results the resulting generator can return, not the page size.
Turning on detailed logging may make the actual behavior clearer:
>>> import arxiv
>>> import logging
>>> logging.basicConfig(level=logging.INFO)
>>>
>>> client = arxiv.Client(page_size=10)
>>> search = arxiv.Search(query="quantum", max_results=100)
>>> generator = client.results(search, offset=13)
>>>
>>> results = list(generator)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=13&max_results=10
INFO:arxiv:Got first page: 10 of 385452 total results
INFO:arxiv:Sleeping: 2.934796 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=23&max_results=10
INFO:arxiv:Sleeping: 2.983968 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=33&max_results=10
INFO:arxiv:Sleeping: 2.986775 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=43&max_results=10
INFO:arxiv:Sleeping: 2.984653 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=53&max_results=10
INFO:arxiv:Sleeping: 2.986806 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=63&max_results=10
INFO:arxiv:Sleeping: 2.984263 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=73&max_results=10
INFO:arxiv:Sleeping: 2.984475 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=83&max_results=10
INFO:arxiv:Sleeping: 2.985515 seconds
INFO:arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=quantum&id_list=&sortBy=relevance&sortOrder=descending&start=93&max_results=10
>>>
>>> len(results)
87
The generator starts at the specified offset
(13), and fetches pages of results until max_results
(100).
This may suit your 2014 example; e.g.
import arxiv
client = arxiv.Client(page_size=10)
search = arxiv.Search(query="test", max_results=10000)
generator = client.results(search, offset=9900) # Makes 10 requests, yields 100 results.
Hi,
I'd like to query all publications for a specific keyword e.g. "quantum" for a specific year, which can easily exceed max_results.
How can I query the "other" publications which are not among the first max_results ?
In an earlier version of this wrapper, this feature has been possible with the "start" parameter.
Best, Jan.