pybliometrics-dev / pybliometrics

Python-based API-Wrapper to access Scopus
https://pybliometrics.readthedocs.io/en/stable/
Other
422 stars 129 forks source link

IndexError in my script, and Scopus429Error or no error in the console #241

Closed raffaem closed 2 years ago

raffaem commented 2 years ago

In my script I get an IndexError exception when I try to download an author using the AuthorRetrieval interface.

The code is:

AuthorRetrieval("57200753910", refresh=True)

The traceback is:

2022-01-21 00:50:51,495 Exception in downloading author 57200753910.
2022-01-21 00:50:51,498 Exception
2022-01-21 00:50:51,499 Type=<class 'IndexError'>
2022-01-21 00:50:51,501 Class=list index out of range
2022-01-21 00:50:51,503 Traceback:
2022-01-21 00:50:51,507 File "C:\Users\RDPCLI~1\AppData\Local\Temp/ipykernel_948/202054747.py", line 6, in download_auth_id
    auth = AuthorRetrieval(author_id, refresh=True)
2022-01-21 00:50:51,508 File "C:\Users\rdpclient\AppData\Roaming\Python\Python310\site-packages\pybliometrics\scopus\author_retrieval.py", line 246, in __init__
    Retrieval.__init__(self, identifier=self._id, api='AuthorRetrieval')
2022-01-21 00:50:51,511 File "C:\Users\rdpclient\AppData\Roaming\Python\Python310\site-packages\pybliometrics\scopus\superclasses\retrieval.py", line 48, in __init__
    Base.__init__(self, params=params, url=url, api=api)
2022-01-21 00:50:51,512 File "C:\Users\rdpclient\AppData\Roaming\Python\Python310\site-packages\pybliometrics\scopus\superclasses\base.py", line 59, in __init__
    resp = get_content(url, api, params, *args, **kwds)
2022-01-21 00:50:51,514 File "C:\Users\rdpclient\AppData\Roaming\Python\Python310\site-packages\pybliometrics\scopus\utils\get_content.py", line 58, in get_content
    header = {'X-ELS-APIKey': KEYS[0],
2022-01-21 00:50:52,186 Waiting 30 secs then retrying

But when I try to reproduce it in the command line, the error is different, as I get a Scopus429Error:

$ python
Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pybliometrics
>>> print(pybliometrics.__version__)
3.2.1.dev1
>>> from pybliometrics.scopus import AuthorRetrieval
>>> a = AuthorRetrieval("57200753910", refresh=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\rdpclient\AppData\Roaming\Python\Python310\site-packages\pybliometrics\scopus\author_retrieval.py", line 246, in __init__
    Retrieval.__init__(self, identifier=self._id, api='AuthorRetrieval')
  File "C:\Users\rdpclient\AppData\Roaming\Python\Python310\site-packages\pybliometrics\scopus\superclasses\retrieval.py", line 48, in __init__
    Base.__init__(self, params=params, url=url, api=api)
  File "C:\Users\rdpclient\AppData\Roaming\Python\Python310\site-packages\pybliometrics\scopus\superclasses\base.py", line 59, in __init__
    resp = get_content(url, api, params, *args, **kwds)
  File "C:\Users\rdpclient\AppData\Roaming\Python\Python310\site-packages\pybliometrics\scopus\utils\get_content.py", line 97, in get_content
    raise errors[resp.status_code](reason)
pybliometrics.scopus.exception.Scopus429Error

It happened before with another Scopus ID, but I was not able to reproduce it in the command line, so I restarted the script, it worked and it moved on.

The problem is, I don't want the script to stop.

Michael-E-Rose commented 2 years ago

Do you have a possibility to test this with a different Python version? I don't have Python3.10 at hands right now.

raffaem commented 2 years ago

do you think it's due to the Python version?

I think it's due to KEYS being empty.

The script had run for some time and it probably exhausted all the keys?

But it should give a different error.

Anyway "index out of range" doesn't look specific to Python 3.10.

I don't have another Python version less than that at the moment.

Michael-E-Rose commented 2 years ago

Yes Py3.10 might be the reason because this package is not (yet) released for Py3.10: https://img.shields.io/pypi/pyversions/pybliometrics.svg

It might be that your download broke. And besides, what can work in Python3.7 might not in Python3.10. Some things do change from version to version.

raffaem commented 2 years ago

Uhm it's happening again.

Why in get_content function of utils.py, at line 77 there is a catch for IndexError:

try:
    KEYS.pop(0)  # Remove current key
    shuffle(KEYS)
    header['X-ELS-APIKey'] = KEYS[0].strip()
    resp = requests.get(url, headers=header, proxies=proxies,
                        params=params)
except IndexError:  # All keys depleted
    break

while at line 58 it just assumes that KEY is non-empty:

header = {'X-ELS-APIKey': KEYS[0],
          'Accept': 'application/json',
          'User-Agent': user_agent}

?

I will try ith 3.9, although I would have avoided maintaining different Python versions on the same system

raffaem commented 2 years ago

Michael, here is the code to reproduce the bug:

import requests
import io
from pybliometrics.scopus.exception import Scopus429Error

def myget(s, headers, proxies, params):
    print("Called myget!")
    resp = requests.models.Response()
    resp.raw = io.BytesIO("{'service-error':{'status':{'statusText': 'boh'}}}".encode("utf8"))
    resp.encoding = "utf-8"
    resp.status_code = 429
    return resp

requests.get = myget

from pybliometrics.scopus import AuthorRetrieval

for i in range(2):
    try:
        auth = AuthorRetrieval("1234")
    except Scopus429Error:
        print("Detected 429 error. Continue")
        continue
Michael-E-Rose commented 2 years ago

Actually that's how it's supposed to work. resp = requests.get(url, headers=header, proxies=proxies, params=params) raises an IndexError when all keys are depleted, which subsequently breaks the while-loop. Then in https://github.com/pybliometrics-dev/pybliometrics/blob/master/pybliometrics/scopus/utils/get_content.py#L89 it raises the Scopus429Error. That's intentional, because it tells the user (you) that all keys are depleted.

The only problem I see is your very code snippet, where you only got the IndexError. In your second code snippet (same message) everything is fine. The only explanation I have for this is that your keys were empty to begin with. Otherwise the traceback would include this line https://github.com/pybliometrics-dev/pybliometrics/blob/master/pybliometrics/scopus/utils/get_content.py#L81, which it doesn't.

raffaem commented 2 years ago

Actually that's how it's supposed to work. resp = requests.get(url, headers=header, proxies=proxies, params=params) raises an IndexError when all keys are depleted,

I was under the impression that requests.get just return a response code of 429 when all keys are depleted, without raising any exception

which subsequently breaks the while-loop. Then in https://github.com/pybliometrics-dev/pybliometrics/blob/master/pybliometrics/scopus/utils/get_content.py#L89 it raises the Scopus429Error. That's intentional, because it tells the user (you) that all keys are depleted.

I understand that

The only problem I see is your very code snippet, where you only got the IndexError.

I get the Scopus429Error the first time, and the IndexError the second time.

The problem is that, when the last key is depleted, this line raises the IndexError, which gets caught here and re-raised here as Scopus429Error.

But if at this point you call pybliometrics again, this will expect some elements in KEYS, which aresn't there since all of them have been popped out

Michael-E-Rose commented 2 years ago

Ah-ha! So when there are no keys left, and one continues to make calls, then the IndexError kicks in.

If this is a correct, than a Scopus429Error would be more informative. However, the result is the same: People have to restart pybliometrics.

raffaem commented 2 years ago

Ah-ha! So when there are no keys left, and one continues to make calls, then the IndexError kicks in.

Yes, that was what I was trying to say with this code

If this is a correct, than a Scopus429Error would be more informative.

I thought so and proposed a PR to implement that

However, the result is the same: People have to restart pybliometrics.

First, they would have to get new API keys. Then, they would have to restart pybliometrics.

Right?

Or do the API keys get "repleted" after a reasonable amount of time?

I think I had read somewhere they would get repleted after one week, if I'm not mistaken.

I was planning to code a scraper that would get new API keys automatically.

But if it is sufficient to just wait a reasonable amount of time and then restart pybliometrics, that won't be necessary.

Michael-E-Rose commented 2 years ago

Then I totally misunderstood you.

But xes, keys reset one week after first usage: https://pybliometrics.readthedocs.io/en/stable/access.html#api-key-quotas-and-429-error

In practice, one rarely hits the quota because with 10 keys you get 200k calls in the Scopus Search API per week.

raffaem commented 2 years ago

If this is a correct, than a Scopus429Error would be more informative. However, the result is the same: People have to restart pybliometrics.

Should I reopen the PR that implements this, if you agree?

Michael-E-Rose commented 2 years ago

Well, if you still need this "problem" solved, can you come up with a fix based on try-except? An if-statement will affect everybody negatively in terms of computational burden, when in reality it affects just 1% of users

raffaem commented 2 years ago

Well, if you still need this "problem" solved, can you come up with a fix based on try-except? An if-statement will affect everybody negatively in terms of computational burden, when in reality it affects just 1% of users

submitted