pybliometrics-dev / pybliometrics

Python-based API-Wrapper to access Scopus
https://pybliometrics.readthedocs.io/en/stable/
Other
410 stars 128 forks source link

[BUG] Some papers with no citations will raise exception TypeError when calling `cc` on CitationOverview result #205

Closed raffaem closed 2 years ago

raffaem commented 3 years ago

The following MWE:

from pybliometrics.scopus import CitationOverview
res = CitationOverview(identifier=["85098969104"], start=2020)
print(res.cc)

will throw the following exception:

$ python3 bug.py
Traceback (most recent call last):
  File "bug.py", line 3, in <module>
    print(res.cc)
  File "/home/raffaele/.local/lib/python3.9/site-packages/pybliometrics/scopus/abstract_citation.py", line 44, in cc
    cites = [int(d['$']) for d in doc['cc']]
  File "/home/raffaele/.local/lib/python3.9/site-packages/pybliometrics/scopus/abstract_citation.py", line 44, in <listcomp>
    cites = [int(d['$']) for d in doc['cc']]
TypeError: string indices must be integers
raffaem commented 3 years ago

The bug also happens with papers with one citation.

MWE:

from pybliometrics.scopus import CitationOverview
res = CitationOverview(identifier=["85100910856"], start=2020)
print(res.grandTotal)
print(res.cc)
raffaem commented 3 years ago

It's not true that all papers with no citations will throw this error either.

This paper has no citations and yet it doesn't throw the error:

>>> res = CitationOverview(identifier=["28844466437"], start=2005)
>>> print(res.grandTotal)
0
>>> res.cc
[[(2005, 0), (2006, 0), (2007, 0), (2008, 0), (2009, 0), (2010, 0), (2011, 0), (2012, 0), (2013, 0), (2014, 0), (2015, 0), (2016, 0), (2017, 0), (2018, 0), (2019, 0), (2020, 0), (2021, 0)]]
Michael-E-Rose commented 3 years ago

The first two example work fine for me. Which pybliometrics version are you using and which OS are you running on?

raffaem commented 2 years ago

I confirm the exception on the MWE of the first post:

$ python3 205_1.py 
Traceback (most recent call last):
  File "205_1.py", line 5, in <module>
    print(res.cc)
  File "/home/raffaele/.local/lib/python3.9/site-packages/pybliometrics/scopus/abstract_citation.py", line 44, in cc
    cites = [int(d['$']) for d in doc['cc']]
  File "/home/raffaele/.local/lib/python3.9/site-packages/pybliometrics/scopus/abstract_citation.py", line 44, in <listcomp>
    cites = [int(d['$']) for d in doc['cc']]
TypeError: string indices must be integers

This is my pybliometrics version:

$ python3
Python 3.9.7 (default, Aug 30 2021, 00:00:00) 
[GCC 11.2.1 20210728 (Red Hat 11.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pybliometrics
>>> print(pybliometrics.__version__)
3.0.1

and I am running on Fedora 34 Workstation.

Michael-E-Rose commented 2 years ago

Still no error on my side.

Could you try to locate the cache file for case 1 and post its content here, please?

raffaem commented 2 years ago

Still no error on my side.

Could you try to locate the cache file for case 1 and post its content here, please?

How can I find this file?

In ~/.scopus/citation_overview/STANDARD I have different files all named in what seems random alphanumeric strings.

Michael-E-Rose commented 2 years ago

Yes, these are hashed versions of the EIDs and the years used for the retrieval. This prevents many problems with filenames. The class' docs, section "Notes" tell you more.

Do the following to obtain the file name:

from hashlib import md5
from pybliometrics.scopus import CitationOverview

identifier = ["28844466437"]
citation = None

co = CitationOverview(identifier=identifier, start=2005, citation=citation)

stem = md5("_".join(identifier).encode('utf8')).hexdigest()
if citation:
    stem += "-" + citation
print(stem)

so the filename is 65637bbaf0de11228e62380ee583e744.

raffaem commented 2 years ago

Here is the content of that file:

{"abstract-citations-response":{"h-index":"0","identifier-legend":{"identifier":[{"@_fa":"true","dc:identifier":"SCOPUS_ID:28844466437","prism:doi":"10.1103/PhysRevE.72.059902","pii":null,"scopus_id":"28844466437"}]},"citeInfoMatrix":{"citeInfoMatrixXML":{"citationMatrix":{"citeInfo":[{"@_fa":"true","dc:identifier":"SCOPUS_ID:28844466437","prism:url":"https://api.elsevier.com/content/abstract/scopus_id/28844466437","dc:title":"Erratum: Reexamination of the Helfrich-Hurault effect in smectic-A liquid crystals (Physical Review E (2005) 72 (041708))","author":[{"@_fa":"true","initials":"G.","index-name":"Bevilacqua G.","surname":"Bevilacqua","authid":"57190392218","author-url":"https://api.elsevier.com/content/author/author_id/57190392218"},{"@_fa":"true","initials":"G.","index-name":"Napoli G.","surname":"Napoli","authid":"10042600200","author-url":"https://api.elsevier.com/content/author/author_id/10042600200"}],"citationType":{"@code":"er","$":"Erratum"},"sort-year":"2005","prism:publicationName":"Physical Review E - Statistical, Nonlinear, and Soft Matter Physics","prism:volume":"72","prism:issueIdentifier":"5","prism:issn":"1539-3755","pcc":"0","cc":[{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"}],"lcc":"0","rangeCount":"0","rowTotal":"0"}]}}},"citeColumnTotalXML":{"citeCountHeader":{"prevColumnHeading":"previous","columnHeading":[{"$":"2005"},{"$":"2006"},{"$":"2007"},{"$":"2008"},{"$":"2009"},{"$":"2010"},{"$":"2011"},{"$":"2012"},{"$":"2013"},{"$":"2014"},{"$":"2015"},{"$":"2016"},{"$":"2017"},{"$":"2018"},{"$":"2019"},{"$":"2020"},{"$":"2021"}],"laterColumnHeading":"later","prevColumnTotal":"0","columnTotal":[{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"}],"laterColumnTotal":"0","rangeColumnTotal":"0","grandTotal":"0"}}}}
Michael-E-Rose commented 2 years ago

For me it's exactly the same.

The relevant part is "cc":[{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"},{"$":"0"}], and that's all normal.

Could you please execute the following code and report back what's happening?

from pybliometrics.scopus import CitationOverview

co = CitationOverview(identifier=["28844466437"], start=2005)
print(co._citeInfoMatrix)
print(co._citeInfoMatrix[0]["cc"])
print([int(d['$']) for d in co._citeInfoMatrix[0]["cc"]])
print(co.cc)
raffaem commented 2 years ago

Code:

#!/usr/bin/env python3

from pybliometrics.scopus import CitationOverview

co = CitationOverview(identifier=["28844466437"], start=2005)
print(co._citeInfoMatrix)
print(co._citeInfoMatrix[0]["cc"])
print([int(d['$']) for d in co._citeInfoMatrix[0]["cc"]])
print(co.cc)

Result:

$ python3 205_more_info.py 
[{'@_fa': 'true', 'identifier': 'SCOPUS_ID:28844466437', 'url': 'https://api.elsevier.com/content/abstract/scopus_id/28844466437', 'title': 'Erratum: Reexamination of the Helfrich-Hurault effect in smectic-A liquid crystals (Physical Review E (2005) 72 (041708))', 'author': [{'@_fa': 'true', 'initials': 'G.', 'index-name': 'Bevilacqua G.', 'surname': 'Bevilacqua', 'authid': '57190392218', 'author-url': 'https://api.elsevier.com/content/author/author_id/57190392218'}, {'@_fa': 'true', 'initials': 'G.', 'index-name': 'Napoli G.', 'surname': 'Napoli', 'authid': '10042600200', 'author-url': 'https://api.elsevier.com/content/author/author_id/10042600200'}], 'citationType': {'@code': 'er', '$': 'Erratum'}, 'sort-year': '2005', 'publicationName': 'Physical Review E - Statistical, Nonlinear, and Soft Matter Physics', 'volume': '72', 'issueIdentifier': '5', 'issn': '1539-3755', 'pcc': '0', 'cc': [{'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}], 'lcc': '0', 'rangeCount': '0', 'rowTotal': '0'}]
[{'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[[(2005, 0), (2006, 0), (2007, 0), (2008, 0), (2009, 0), (2010, 0), (2011, 0), (2012, 0), (2013, 0), (2014, 0), (2015, 0), (2016, 0), (2017, 0), (2018, 0), (2019, 0), (2020, 0), (2021, 0)]]
Michael-E-Rose commented 2 years ago

Okay, so it all works.

raffaem commented 2 years ago

No it doesn't work.

The exception is still there.

Code:

#!/usr/bin/env python3

# https://github.com/pybliometrics-dev/pybliometrics/issues/205

from pybliometrics.scopus import CitationOverview
res = CitationOverview(identifier=["85098969104"], start=2020)
print(res.cc)

Result:

$ python3 205_1.py 
Traceback (most recent call last):
  File "/run/media/raffaele/55ab61c4-83cf-4d9f-a5cd-7fcfdc14b4fb/Dropbox (DIG)/Paper_covidworking_and_productivity_RM/5_download_citations/pybliometrics_bugs/205_1.py", line 7, in <module>
    print(res.cc)
  File "/home/raffaele/.local/lib/python3.9/site-packages/pybliometrics/scopus/abstract_citation.py", line 44, in cc
    cites = [int(d['$']) for d in doc['cc']]
  File "/home/raffaele/.local/lib/python3.9/site-packages/pybliometrics/scopus/abstract_citation.py", line 44, in <listcomp>
    cites = [int(d['$']) for d in doc['cc']]
TypeError: string indices must be integers

Thanks

Michael-E-Rose commented 2 years ago

Could you then please print the output of above snippet with the Scopus ID that's not working?

raffaem commented 2 years ago

The snippet is:

from pybliometrics.scopus import CitationOverview
res = CitationOverview(identifier=["85098969104"], start=2020)
print(res.cc)

The Scopus ID that is not working is 85098969104

Michael-E-Rose commented 2 years ago

What should I do with it? I need the output of the snipped that I posted with your ID that's not working.

from pybliometrics.scopus import CitationOverview

co = CitationOverview(identifier=["85098969104"], start=2005, refresh=True)
print(co._citeInfoMatrix)
print(co._citeInfoMatrix[0]["cc"])
print([int(d['$']) for d in co._citeInfoMatrix[0]["cc"]])
print(co.cc)
raffaem commented 2 years ago

Here is the output of the snippet:

$ python3 205_more_info_2.py 
[{'@_fa': 'true', 'identifier': 'SCOPUS_ID:85098969104', 'url': 'https://api.elsevier.com/content/abstract/scopus_id/85098969104', 'title': 'Correction to: The delamination of a growing elastic sheet with adhesion (Meccanica, (2017), 52, 14, (3481-3487), 10.1007/s11012-017-0618-0)', 'author': [{'@_fa': 'true', 'initials': 'G.', 'index-name': 'Napoli G.', 'surname': 'Napoli', 'authid': '10042600200', 'author-url': 'https://api.elsevier.com/content/author/author_id/10042600200'}, {'@_fa': 'true', 'initials': 'S.', 'index-name': 'Turzi S.', 'surname': 'Turzi', 'authid': '14631459100', 'author-url': 'https://api.elsevier.com/content/author/author_id/14631459100'}], 'citationType': {'@code': 'er', '$': 'Erratum'}, 'sort-year': '2021', 'publicationName': 'Meccanica', 'volume': '56', 'issueIdentifier': '1', 'startingPage': '253', 'issn': '0025-6455', 'pcc': '0', 'cc': [{'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}], 'lcc': '0', 'rangeCount': '0', 'rowTotal': '0'}]
[{'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}, {'$': '0'}]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[[(2005, 0), (2006, 0), (2007, 0), (2008, 0), (2009, 0), (2010, 0), (2011, 0), (2012, 0), (2013, 0), (2014, 0), (2015, 0), (2016, 0), (2017, 0), (2018, 0), (2019, 0), (2020, 0), (2021, 0)]]

Now it works BTW:

$ python3 205_1.py 
[[(2020, 0), (2021, 0)]]

Seems another bug fixed by refresh=True.

But why it happens that the download is damaged in the first place?

Michael-E-Rose commented 2 years ago

I met all kinds of weird errors. Sometimes the download interrupts, rarely the API returns faulty code, etc. Annoying, but rare, and can be fixed by re-downloading.