lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.07k stars 120 forks source link

AttributeError: nonexistent IDs in `id_list`s yield invalid entries #80

Closed lukasschwab closed 3 years ago

lukasschwab commented 3 years ago

Description

A clear and concise description of what the bug is.

When a specified ID doesn't correspond to an arXiv paper, the results feed includes an entry element missing expected fields (id).

The status is 200, but feedparser chokes and the error-handling in this package tries to access the nonexistent ID, yielding a raw AttributeError

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

Example API feed: http://export.arxiv.org/api/query?id_list=2208.05394

>>> import arxiv
>>> pub = next(arxiv.Search(id_list=["2208.05394"]).get())
Traceback (most recent call last):
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/feedparser/util.py", line 156, in __getattr__
    return self.__getitem__(key)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/feedparser/util.py", line 113, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 586, in results
    yield Result._from_feed_entry(entry)
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/arxiv/arxiv.py", line 122, in _from_feed_entry
    entry.id
  File "/Users/lukas/.pyenv/versions/3.7.9/lib/python3.7/site-packages/feedparser/util.py", line 158, in __getattr__
    raise AttributeError("object has no attribute '%s'" % key)
AttributeError: object has no attribute 'id'

Expected behavior

A clear and concise description of what you expected to happen.

This package's error handling should return a neatly handleable error.

Versions

lukasschwab commented 3 years ago

A design problem: the feeds for id_list-only queries are ordinal matches for the IDs in the id_list. If you want to see if the nth ID exists in arXiv, check if the nth entry in the feed is well-formed or empty. See, for example, this feed.

Returning None from the generator would preserve this relationship, but forces clients to check whether entries are None when processing them.

Skipping the partial entries breaks the ordinal relationship. There's a work-around: you can still check existence by looking up in the aggregate results.

Since this usage (testing ID existence) seems less likely, I'm inclined to require some dependents to do the latter rather than requiring all projects to do the former.

If this use case turns out to be common, we can parameterize an invalid-entry handler in the Client options, e.g. lambda entry: None, to. override the skipping.

lukasschwab commented 3 years ago

Another risk with skipping partial results: doing so may confuse a dependent's length-checking pagination logic.

lukasschwab commented 3 years ago

Final consideration:

Skipping the partial entries breaks the ordinal relationship. There's a work-around: you can still check existence by looking up in the aggregate results.

Since this usage (testing ID existence) seems less likely, I'm inclined to require some dependents to do the latter rather than requiring all projects to do the former.

No dependent of this package relies on the ordinal relationship, because any request that would be impacted by this change currently fails. Skipping the results is the least disruptive option.