lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.07k stars 120 forks source link

Investigating: arXiv API flakiness #129

Closed lukasschwab closed 11 months ago

lukasschwab commented 11 months ago

Description

A clear and concise description of what the bug is.

The arXiv API seems to be degraded. I expect to see more bug reports about this until the underlying issue is resolved.

Behavior identified in #43 seems to have intensified or changed in character (e.g. increased clustering, such that retries are more likely to re-fail, perhaps because of cached bad responses).

Why can't you fix the API? : I'm not affiliated with arXiv — I maintain a wrapper library for an API I don't administer. I've written the arxiv-api Google Group about this issue.

Why aren't you merging bug fixes? : Some of the proposed changes here (e.g. consolidating on HTTPS, pinning a specific feedparser version, etc.) are probably good changes regardless of the API's stability. I'm hesitant to rush merging and releasing changes without having a strong sense, through integration tests, that they don't damage this library's behavior. That judgment is subject to change, esp. if this issue persists.

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

Versions

Additional context

Add any other context about the problem here.

PRs directly addressing the instability:

lukasschwab commented 11 months ago

Good example of flakiness between identical versions/protocols: https://github.com/lukasschwab/arxiv.py/pull/132#issuecomment-1763650934

liyucheng09 commented 11 months ago

Good diagnosis for this issue. I guess there is not too much we can do unless they fix the backend.

liyucheng09 commented 11 months ago

BTW I found arxiv treats requests differently for programatic clients and real browsers. I suspect this flakiness is on purpose.

lukasschwab commented 11 months ago

@liyucheng09 can you share any details on that investigation? In #127 I tried tweaking the user-agent.

liyucheng09 commented 11 months ago

I tried about 300 attempts hourly today. More than 3000 in total. 0 out of 3000 suceeded. By sending a user-agent to the feedparser, 28 out of 100 suceeded. I suppose we could safely say arxiv is declining requests from programmatic clients.

Ar4ikov commented 11 months ago

Hello! feedparser that needs to arxiv lib works contains that... I really can't describe my emotions, when I'd seen that first time. (feedparser/init.py)

USER_AGENT = "feedparser/%s +https://github.com/kurtmckee/feedparser/" % __version__

Does the developer find out this funny?
Instead of using normally worked application, I need to cp -r /path/to/site-packages/feedparser /path/to/my-project-dir/, change USER_AGENT to my real and finally! ArXiv API works 100 times of 100.

It will be much MUCH better, if feedparser will use something like that:

from os import environ

# <...>
USER_AGENT = environ.get('PYTHON_FEEDPARSER_USER_AGENT', "feedparser/%s +https://github.com/kurtmckee/feedparser/" % __version__)  # thank you for you joke, I I throw to the garbage myself and my 2 days for running my project that use langchain and ArXiVLoader
lukasschwab commented 11 months ago

@Ar4ikov I believe all currently-released versions of feedparser support specifying the User-Agent header through a named parameter (agent) to feedparser.parse, but — to your point — this package neither overrides the default nor exposes a way to set it.

I think the most robust change is to make the HTTP calls from arxiv (e.g. with requests), then pass the body to feedparser for parsing.

Nonetheless, my testing hasn't shown that updating the user agent makes the tests pass 100% of the time. Still searching, but I'll investigate this angle more.

Update: I published the major version release.

If you find any issues with the new version unrelated to the API instability, please open separate issues for those! I rolled this release in a hurry.

lukasschwab commented 11 months ago

The API seems much more stable now than it was over the weekend. CI is consistently succeeding locally.

I'm going to close this issue for the time being. I'll reopen it in the future if I see similar instability (increased rate of unexpectedly empty first pages, ConnectionReset errors).

jaypantone commented 6 months ago

I know this is closed, but I just wanted to add that over the last week or two I have started to experience this issue. The API calls occasionally return empty results erroneously.

lukasschwab commented 6 months ago

@jaypantone yeah, lots of inbound issues about this. I don't work for arXiv, so I can't affect a change there directly.

Don't overload them with requests, but you might consider describing your issue on the arXiv mailing list:

I've pinned this issue in the hopes that more people find it rather than creating new ones.