Closed lukasschwab closed 11 months ago
Good example of flakiness between identical versions/protocols: https://github.com/lukasschwab/arxiv.py/pull/132#issuecomment-1763650934
Good diagnosis for this issue. I guess there is not too much we can do unless they fix the backend.
BTW I found arxiv treats requests differently for programatic clients and real browsers. I suspect this flakiness is on purpose.
@liyucheng09 can you share any details on that investigation? In #127 I tried tweaking the user-agent.
I tried about 300 attempts hourly today. More than 3000 in total. 0 out of 3000 suceeded. By sending a user-agent to the feedparser, 28 out of 100 suceeded. I suppose we could safely say arxiv is declining requests from programmatic clients.
Hello! feedparser
that needs to arxiv
lib works contains that... I really can't describe my emotions, when I'd seen that first time. (feedparser/init.py)
USER_AGENT = "feedparser/%s +https://github.com/kurtmckee/feedparser/" % __version__
Does the developer find out this funny?
Instead of using normally worked application, I need to cp -r /path/to/site-packages/feedparser /path/to/my-project-dir/
, change USER_AGENT
to my real and finally! ArXiv API works 100 times of 100.
It will be much MUCH better, if feedparser
will use something like that:
from os import environ
# <...>
USER_AGENT = environ.get('PYTHON_FEEDPARSER_USER_AGENT', "feedparser/%s +https://github.com/kurtmckee/feedparser/" % __version__) # thank you for you joke, I I throw to the garbage myself and my 2 days for running my project that use langchain and ArXiVLoader
@Ar4ikov I believe all currently-released versions of feedparser
support specifying the User-Agent header through a named parameter (agent
) to feedparser.parse
, but — to your point — this package neither overrides the default nor exposes a way to set it.
I think the most robust change is to make the HTTP calls from arxiv
(e.g. with requests
), then pass the body to feedparser
for parsing.
Nonetheless, my testing hasn't shown that updating the user agent makes the tests pass 100% of the time. Still searching, but I'll investigate this angle more.
Update: I published the major version release.
If you find any issues with the new version unrelated to the API instability, please open separate issues for those! I rolled this release in a hurry.
The API seems much more stable now than it was over the weekend. CI is consistently succeeding locally.
I'm going to close this issue for the time being. I'll reopen it in the future if I see similar instability (increased rate of unexpectedly empty first pages, ConnectionReset
errors).
I know this is closed, but I just wanted to add that over the last week or two I have started to experience this issue. The API calls occasionally return empty results erroneously.
@jaypantone yeah, lots of inbound issues about this. I don't work for arXiv, so I can't affect a change there directly.
Don't overload them with requests, but you might consider describing your issue on the arXiv mailing list:
I've pinned this issue in the hopes that more people find it rather than creating new ones.
Description
The arXiv API seems to be degraded. I expect to see more bug reports about this until the underlying issue is resolved.
Behavior identified in #43 seems to have intensified or changed in character (e.g. increased clustering, such that retries are more likely to re-fail, perhaps because of cached bad responses).
Why can't you fix the API? : I'm not affiliated with arXiv — I maintain a wrapper library for an API I don't administer. I've written the arxiv-api Google Group about this issue.
Why aren't you merging bug fixes? : Some of the proposed changes here (e.g. consolidating on HTTPS, pinning a specific
feedparser
version, etc.) are probably good changes regardless of the API's stability. I'm hesitant to rush merging and releasing changes without having a strong sense, through integration tests, that they don't damage this library's behavior. That judgment is subject to change, esp. if this issue persists.Steps to reproduce
Versions
python
version: independent.arxiv.py
version:1.*.*
.Additional context
PRs directly addressing the instability:
127
128