Closed sidphbot closed 3 years ago
Huh, interesting. Confirmed I can reproduce this with a minimal case:
>>> list(arxiv.Search(id_list=['0112019']).results())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 554, in results
feed = self._parse_feed(page_url, first_page)
File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 637, in _parse_feed
raise err
arxiv.arxiv.HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)
And that this isn't caused by stripping the version indicators from the IDs:
>>> list(arxiv.Search(id_list=['0112019v1']).results())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 554, in results
feed = self._parse_feed(page_url, first_page)
File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 637, in _parse_feed
raise err
arxiv.arxiv.HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)
I'm pretty confident this is a bug in the underlying API. The client passes these IDs (which we know are valid: they're listed on arxiv.org) directly to the arXiv API. It doesn't do any preprocessing besides comma-separating them.
I reproduced this issue in the browser by generating the query URL:
>>> arxiv.Client()._format_url(s, 0, 10)
'http://export.arxiv.org/api/query?search_query=&id_list=0112019v1&sortBy=relevance&sortOrder=descending&start=0&max_results=10'
The API gives a 400 response, but with a non-empty feed body:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<link href="http://arxiv.org/api/query?search_query%3D%26id_list%3D0112019v1%26start%3D0%26max_results%3D10%26sortBy%3Drelevance%26sortOrder%3Ddescending" rel="self" type="application/atom+xml"/>
<title type="html">ArXiv Query: search_query=&id_list=0112019v1&start=0&max_results=10&sortBy=relevance&sortOrder=descending</title>
<id>http://arxiv.org/api/ICCqNwWyrQkMAErZidA/EoTr7/o</id>
<updated>2021-07-12T00:00:00-04:00</updated>
<opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:totalResults>
<opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
<opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:itemsPerPage>
<entry>
<id>http://arxiv.org/api/errors#incorrect_id_format_for_0112019v1</id>
<title>Error</title>
<summary>incorrect id format for 0112019v1</summary>
<updated>2021-07-12T00:00:00-04:00</updated>
<link href="http://arxiv.org/api/errors#incorrect_id_format_for_0112019v1" rel="alternate" type="text/html"/>
<author>
<name>arXiv api core</name>
</author>
</entry>
</feed>
At least this confirms the issue: the API believes the ID (0112019v1) is of an incorrect format. My hunch is that this was an early ID format (0112019
and 0205137
are papers from 2001 and 2002, respectively) that the API just doesn't support.
I'll shoot a message to the Google Group. Unfortunately, this doesn't seem like an issue a client library can fix ☹️
These are an old (pre-March 2007) identifier format. The structure of that old identifier and the motivation for the 2007 change are described here:
All existing articles retain their original identifiers but newly announced articles have identifiers following the new scheme.
Aha! It looks like old-form IDs can be requested; they just need to be fully-qualified with the archive and (where applicable) subject class.
The fully-qualified ID for 0112019
is quant-ph/0112019
. Accordingly, the following code works:
>>> import arxiv
>>> next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
[arxiv.Result(entry_id='http://arxiv.org/abs/quant-ph/0112019v1', updated=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), published=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), title='Classical entanglement', authors=[arxiv.Result.Author('Douglas G. Danforth')], summary='Classical systems can be entangled. Entanglement is defined by coincidence\ncorrelations. Quantum entanglement experiments can be mimicked by a mechanical\nsystem with a single conserved variable and 77.8% conditional efficiency.\nExperiments are replicated for four particle entanglement swapping and GHZ\nentanglement.', comment=None, journal_ref=None, doi=None, primary_category='quant-ph', categories=['quant-ph'], links=[arxiv.Result.Link('http://arxiv.org/abs/quant-ph/0112019v1', title=None, rel='alternate', content_type=None), arxiv.Result.Link('http://arxiv.org/pdf/quant-ph/0112019v1', title='pdf', rel='related', content_type=None)])]
But the short ID reported by this client library is incorrect:
>>> r = next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
>>> r.entry_id
'http://arxiv.org/abs/quant-ph/0112019v1'
Instead of just taking the last path element here, I should be taking the full contents of the path following http://arxiv.org/abs/
:
@sidphbot if you're working from hardcoded IDs, adding the archives should solve this issue for you.
If you're re-querying incorrect IDs returned by this client library, I'll have a patch out shortly.
Hi, Thank you for looking into it, yes I am re-querying after stripping the id from result.entry field unfortunately, though thanks for the information, I will try parsing the ID as you mentioned. Also, I will gladly wait for the patch 😇
@sidphbot patch is included in 1.4.0.
Error:
Code for parsing id from arxiv result object-
id = urlparse(result.entry_id).path.split('/')[-1].split('v')[0]
code to reproduce -
invalid ids are
'0112019', '0205137'
etcrespective pdf urls still accessible, for example : https://arxiv.org/pdf/quant-ph/0112019.pdf
The same error is referenced in another open issue but from the perspective of huge id arrays. [ issue ID : #15]
Apologies if I simply lack sufficient knowledge about identifier naming conventions but it should download from all research fields right?