pybliometrics-dev / pybliometrics

Python-based API-Wrapper to access Scopus
https://pybliometrics.readthedocs.io/en/stable/
Other
407 stars 127 forks source link

Feature Request: Retry on Scopus 500 and possibly 502 instead of error #249

Closed mark-todd closed 10 months ago

mark-todd commented 2 years ago

pybliometrics version: 3.2.0

Affected classes: Base, Retrieval, Search (All superclasses)

Occasionally Scopus' internal servers will raise an "Internal Server Exception", status code: 500 - this occurs seemingly at random, and Pybliometrics currently handles it quite well by raising its "Scopus500Error" when this occurs. However, this can be frustrating when performing a large search (of the order 1000) requests, as this will reset progress and cause you to be forced to start the requests again (As nothing has been cached). Worse still, this means that the weeks allowance of 20000 for papers is depleted very rapidly. I understand that, due to the cursor system, to perform large searches in sections would be impossible (https://dev.elsevier.com/api_key_settings.html - 5000 item total result limit without 'cursor' pagination. ).

Would it be possible, when using cursors, to retry this particular get request after x time when this error is received? I appreciate this does somewhat break the idea that Pybliometrics only performs the number of requests that "get_no_requests" returned, but it seems like the best solution I can see. Odds are it was only one bad request in a series of 1000 good ones, so I don't think it would increase the number of requests performed very much.

I'd be interested to hear any thoughts on this, or potentially other solutions.

I haven't received a 502 error but it seems very similar so I imagine the same logic would apply.

Michael-E-Rose commented 2 years ago

I got to talk to one of Scopus' developers. The 5xx errors are totally out of their control, even beyond the scope of the particular API. It's unclear what triggers them. Most importantly, it's unclear how long one should wait in case of a 5xx error, or how often one should wait if it fails repeatedly.

As some changes to the API are under way, Scopus might want to deal with 5xx errors.

In the meantime I'd advise to perform queries with smaller results sets. I always try to perform multiple very granular queries. This also increases the odds that one can reuse them.

mark-todd commented 2 years ago

Interesting - thanks for checking this out! Is this because the 5xx errors are raised at the level below the Scopus API? I suppose one thing that wouldn't be out of Scopus' control would be if they allowed results above 5000 in one scopus search query. Then the start and end point in the results output could be specified, and in case of a 5xx error only this one section would need to be re-retrieved. Or is this another quirk of the technology Scopus are using rather than the one they make? It would be interesting to know what technology they use - it's possible this is an issue for other projects.

Michael-E-Rose commented 2 years ago

I wish I knew more, but the Scopus documentation isn't very transparent on that.

Two or three years ago they introduced pagination in the Scopus Search API, which essentially allowed for queries of unlimited length. Ever since I was hoping that they extended this to the other seach classes.

Michael-E-Rose commented 10 months ago

In 14208248e33bc8b8c9a23a32830bdaecc85428ee we introduced requests.Session() to handle automatic re-connections upon specific errors. The 5xx errors are such an example. I guess this solves the issue. In the configuration file, you can set how often pybliometrics must attempt to re-connect.