lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.11k stars 123 forks source link

`get_short_id` incorrect for pre-March 2007 arXiv identifiers: missing archive #74

Closed sidphbot closed 3 years ago

sidphbot commented 3 years ago

Error:

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in results(self, search)
    552             ))
    553             page_url = self._format_url(search, offset, page_size)
--> 554             feed = self._parse_feed(page_url, first_page)
    555             if first_page:
    556                 # NOTE: this is an ugly fix for a known bug. The totalresults

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in _parse_feed(self, url, first_page)
    635         # Feed was never returned in self.num_retries tries. Raise the last
    636         # exception encountered.
--> 637         raise err
    638 
    639 

HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

Code for parsing id from arxiv result object- id = urlparse(result.entry_id).path.split('/')[-1].split('v')[0]

code to reproduce -

ids = ['1911.10854', '1905.00256', '0112019', '1202.2184', '1708.03109', '0205137', '1610.08147', '2003.05245', '0406182', '0708.3630', '0503148', '1111.6170', '1612.04479', '0307110', '0306127', '1307.2727', '0402059', '1012.4706', '1906.01999', '0101032']

papers = arxiv.Search(id_list=ids).get()

invalid ids are '0112019', '0205137' etc

respective pdf urls still accessible, for example : https://arxiv.org/pdf/quant-ph/0112019.pdf

The same error is referenced in another open issue but from the perspective of huge id arrays. [ issue ID : #15]

Apologies if I simply lack sufficient knowledge about identifier naming conventions but it should download from all research fields right?

lukasschwab commented 3 years ago

Huh, interesting. Confirmed I can reproduce this with a minimal case:

>>> list(arxiv.Search(id_list=['0112019']).results())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 554, in results
    feed = self._parse_feed(page_url, first_page)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 637, in _parse_feed
    raise err
arxiv.arxiv.HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

And that this isn't caused by stripping the version indicators from the IDs:

>>> list(arxiv.Search(id_list=['0112019v1']).results())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 554, in results
    feed = self._parse_feed(page_url, first_page)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 637, in _parse_feed
    raise err
arxiv.arxiv.HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

I'm pretty confident this is a bug in the underlying API. The client passes these IDs (which we know are valid: they're listed on arxiv.org) directly to the arXiv API. It doesn't do any preprocessing besides comma-separating them.

I reproduced this issue in the browser by generating the query URL:

>>> arxiv.Client()._format_url(s, 0, 10)
'http://export.arxiv.org/api/query?search_query=&id_list=0112019v1&sortBy=relevance&sortOrder=descending&start=0&max_results=10'

The API gives a 400 response, but with a non-empty feed body:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3D%26id_list%3D0112019v1%26start%3D0%26max_results%3D10%26sortBy%3Drelevance%26sortOrder%3Ddescending" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=&amp;id_list=0112019v1&amp;start=0&amp;max_results=10&amp;sortBy=relevance&amp;sortOrder=descending</title>
  <id>http://arxiv.org/api/ICCqNwWyrQkMAErZidA/EoTr7/o</id>
  <updated>2021-07-12T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:itemsPerPage>
  <entry>
    <id>http://arxiv.org/api/errors#incorrect_id_format_for_0112019v1</id>
    <title>Error</title>
    <summary>incorrect id format for 0112019v1</summary>
    <updated>2021-07-12T00:00:00-04:00</updated>
    <link href="http://arxiv.org/api/errors#incorrect_id_format_for_0112019v1" rel="alternate" type="text/html"/>
    <author>
      <name>arXiv api core</name>
    </author>
  </entry>
</feed>

At least this confirms the issue: the API believes the ID (0112019v1) is of an incorrect format. My hunch is that this was an early ID format (0112019 and 0205137 are papers from 2001 and 2002, respectively) that the API just doesn't support.

I'll shoot a message to the Google Group. Unfortunately, this doesn't seem like an issue a client library can fix ☹️

lukasschwab commented 3 years ago

These are an old (pre-March 2007) identifier format. The structure of that old identifier and the motivation for the 2007 change are described here:

All existing articles retain their original identifiers but newly announced articles have identifiers following the new scheme.

lukasschwab commented 3 years ago

Aha! It looks like old-form IDs can be requested; they just need to be fully-qualified with the archive and (where applicable) subject class.

Explanation The old-form arXiv ID is a combination of a subject component, a date component, and a counter component. ![Diagram breaking down the old-form arXiv ID into its components](https://arxiv.org/icons/arxiv_identifier.png) `0112019` is the `019`th paper submitted on the `12`th month of 20`01`... but, because the counts are archive-specific, the numeric component isn't unique. There is a `0112019` in quantum physics, but there may also be a `0112019` in astrophysics and a `0112019` in math. This old format only uniquely identifies a paper if we specify *which archive's count* it refers to. In this case, we want `quant-ph`

The fully-qualified ID for 0112019 is quant-ph/0112019. Accordingly, the following code works:

>>> import arxiv
>>> next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
[arxiv.Result(entry_id='http://arxiv.org/abs/quant-ph/0112019v1', updated=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), published=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), title='Classical entanglement', authors=[arxiv.Result.Author('Douglas G. Danforth')], summary='Classical systems can be entangled. Entanglement is defined by coincidence\ncorrelations. Quantum entanglement experiments can be mimicked by a mechanical\nsystem with a single conserved variable and 77.8% conditional efficiency.\nExperiments are replicated for four particle entanglement swapping and GHZ\nentanglement.', comment=None, journal_ref=None, doi=None, primary_category='quant-ph', categories=['quant-ph'], links=[arxiv.Result.Link('http://arxiv.org/abs/quant-ph/0112019v1', title=None, rel='alternate', content_type=None), arxiv.Result.Link('http://arxiv.org/pdf/quant-ph/0112019v1', title='pdf', rel='related', content_type=None)])]

But the short ID reported by this client library is incorrect:

>>> r = next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
>>> r.entry_id
'http://arxiv.org/abs/quant-ph/0112019v1'

Instead of just taking the last path element here, I should be taking the full contents of the path following http://arxiv.org/abs/:

https://github.com/lukasschwab/arxiv.py/blob/ea93efa9f369da995f657856447f4ad998f9076f/arxiv/arxiv.py#L169-L176

@sidphbot if you're working from hardcoded IDs, adding the archives should solve this issue for you.

If you're re-querying incorrect IDs returned by this client library, I'll have a patch out shortly.

sidphbot commented 3 years ago

Hi, Thank you for looking into it, yes I am re-querying after stripping the id from result.entry field unfortunately, though thanks for the information, I will try parsing the ID as you mentioned. Also, I will gladly wait for the patch 😇

lukasschwab commented 3 years ago

@sidphbot patch is included in 1.4.0.