scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.29k stars 292 forks source link

Publication parser fill list index out of range #268

Closed FranciscoKnebel closed 3 years ago

FranciscoKnebel commented 3 years ago

Code executed:

search_query = scholarly.search_pubs(query, year_low=2020)
r = next(search_query)
result = scholarly.fill(r)

Python complains while executing line 345 of scholarly/publication_parser.py with the following message: IndexError: list index out of range

Probably due to undefined values, causing an empty list? Well, search query breaks due to this.

Traceback below:

Traceback (most recent call last): File "scholar.py", line 104, in main() File "scholar.py", line 58, in main result = scholarly.fill(r) File "/home/fpk/.local/lib/python3.8/site-packages/scholarly/_scholarly.py", line 207, in fill object = publication_parser.fill(object) File "/home/fpk/.local/lib/python3.8/site-packages/scholarly/publication_parser.py", line 345, in fill parsed_bib = remap_bib(bibtexparser.loads(bibtex,parser).entries[-1], _BIB_MAPPING, _BIB_DATATYPES)

ipeirotis commented 3 years ago

Without knowing the query, it is not possible to replicate the issue.

FranciscoKnebel commented 3 years ago

Without knowing the query, it is not possible to replicate the issue.

Of course. The query I am using is just a simple string, with two keywords.

query = '\"digital twin\" \"cloud\"'
FranciscoKnebel commented 3 years ago

a

On the 5th item with query = '\"digital twin\"' it happened again. The bug does not seem to be deterministic, so it may take a couple tries.

ipeirotis commented 3 years ago

Could not replicate.

The following code, executed on Colab:

!pip3 install -U scholarly
from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()
pg.Luminati(usr= "....",passwd ="....", proxy_port  = "....")
scholarly.use_proxy(pg)

query = '\"digital twin\" \"cloud\"'
search_query = scholarly.search_pubs(query, year_low=2020)
r = next(search_query)
result = scholarly.fill(r)
scholarly.pprint(result)

Generated the following:

{'author_id': ['pRFdsgkAAAAJ', '', '', '7EC2OLgAAAAJ'],
 'bib': {'abstract': 'Battery management is critical to enhancing the safety, '
                     'reliability, and performance of the battery systems. '
                     'This paper presents a cloud battery management system '
                     'for battery systems to improve the computational power '
                     'and data storage capability by cloud',
         'author': 'Li, Weihan and Rentemeister, Monika and Badeda, Julia and '
                   'J{\\"o}st, Dominik and Schulte, Dominik and Sauer, Dirk '
                   'Uwe',
         'bib_id': 'li2020digital',
         'journal': 'Journal of Energy Storage',
         'pages': '101557',
         'pub_type': 'article',
         'pub_year': '2020',
         'publisher': 'Elsevier',
         'title': 'Digital twin for battery systems: Cloud battery management '
                  'system with online state-of-charge and state-of-health '
                  'estimation',
         'venue': 'Journal of Energy …',
         'volume': '30'},
 'citedby_url': '/scholar?cites=13311481987179348730&as_sdt=5,33&sciodt=0,33&hl=en',
 'eprint_url': 'https://www.sciencedirect.com/science/article/pii/S2352152X20308495',
 'filled': True,
 'gsrank': 1,
 'num_citations': 19,
 'pub_url': 'https://www.sciencedirect.com/science/article/pii/S2352152X20308495',
 'source': 'PUBLICATION_SEARCH_SNIPPET',
 'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3D%2522digital%2Btwin%2522%2B%2522cloud%2522%26hl%3Den%26as_sdt%3D0,33%26as_ylo%3D2020&citilm=1&json=&update_op=library_add&info=-gLWD1fiu7gJ&ei=7RFBYLiYJoi8ywTzlp64DA',
 'url_related_articles': '/scholar?q=related:-gLWD1fiu7gJ:scholar.google.com/&scioq=%22digital+twin%22+%22cloud%22&hl=en&as_sdt=0,33&as_ylo=2020',
 'url_scholarbib': '/scholar?q=info:-gLWD1fiu7gJ:scholar.google.com/&output=cite&scirp=0&hl=en'}
FranciscoKnebel commented 3 years ago

As I said, it's not deterministic. I'll try creating a test sample.

ipeirotis commented 3 years ago

If it is not deterministic, then it is very likely a proxy/network issue.

FranciscoKnebel commented 3 years ago

On your test sample, did you try running fill(next(...)) multiple times? Problem seems to happen after a couple iterations.

ipeirotis commented 3 years ago

I listed the exact code that I used. I ran it multiple times, without issues.

FranciscoKnebel commented 3 years ago

This is the script i'm running, with removed credentials. I tested with the proxy suggested in https://github.com/scholarly-python-package/scholarly/issues/261, same results.

Basically, the idea is to keep an updated spreadsheet of the resulting publications. If publication already in list it is skipped. There are set limits on amount of publications added and publications parsed. Scholar quickly blocked, so there is no way to test this without a proxy.

from scholarly import scholarly, ProxyGenerator
from datetime import datetime
from scraper_api import ScraperAPIClient

from models import Article, list_has_article
from sheets import get_articles_from_sheet, insert_article_in_sheet

MAX_ARTICLES_PER_RUN=10
MAX_PARSED_ARTICLES=50

class ScraperAPI(ProxyGenerator):
  def __init__(self, api_key):
    self._api_key = api_key
    self._client = ScraperAPIClient(api_key)

    assert api_key is not None

    super(ScraperAPI, self).__init__()

    self._TIMEOUT = 120
    self._session = self._client
    self._session.proxies = {}

  def _new_session(self):
    self.got_403 = False
    return self._session

  def _close_session(self):
    pass  # no need to close the ScraperAPI client

def main():
  print("Starting scholar parser...")
  pg = ScraperAPI('...')
  scholarly.use_proxy(pg)
  scholarly.set_timeout(120)

  print("Proxy set")

  print("Getting saved articles...")
  sheet_articles = get_articles_from_sheet() # this returns a list of Articles.
  print("Saved articles received.")

  print("Defining search query...")
  search_query = scholarly.search_pubs('\"digital twin\" \"cloud\"', year_low=2020)

  loop_count = 0
  articles_found = 0
  print('Starting article loop')
  while articles_found < MAX_ARTICLES_PER_RUN and loop_count < MAX_PARSED_ARTICLES:
    print('Getting next result...')
    result = scholarly.fill(next(search_query))
    loop_count += 1

    entry = Article(
      result["bib"]["title"],
      result["bib"]["abstract"],
      result["num_citations"],
      result["bib"]["journal"],
      result["bib"]["pub_type"],
      result["pub_url"],
      datetime.now().strftime("%d/%m/%Y %H:%M:%S"),
      datetime.now().strftime("%d/%m/%Y %H:%M:%S")
    )

    print('Article \"' + entry.title + '\".')

    if list_has_article(sheet_articles, entry):
      print('\tArticle already on list. Skipping...')
    else:
      articles_found += 1

      print('\tInserting article in spreadsheet')
      insert_article_in_sheet(article=entry)
      sheet_articles.append(entry)

  print("Parser closing.")

  print("\tArticles found:", articles_found)
  print("\tSheet articles:", len(sheet_articles))
  print("\tLoop count:", loop_count)

if __name__ == '__main__':
  main()

Error appeared on this test on the second fill(next(...)) image

Running on Ubuntu on WSL2, Python 3.8.5, scholarly 1.1.0

I'll try testing with Luminati, instead of this other proxy service.

FranciscoKnebel commented 3 years ago

Update:

Luminati with Data Center is very blocked, always returning status 400/403 to scholar requests. ScraperAPI also gets some status 400/403, but is functioning sometimes.

I ran the script with all logging levels and got this, right before it failed with the reported message. I believe that this error message on blocked requests may need extra error handling. Perhaps it is trying to use this page to parse the bibtex (which would be incorrect, since there are no publications on the blocked error page).

2021-03-04 17:35:20,964 - bibtexparser.bparser - DEBUG - Store comment in list of comments: '<!DOCTYPE html>Error 403 (Forbidden)!!1

403. That’s an error.

Your client does not have permission to get URL /scholar.bib?q=info:-gLWD1fiu7gJ:scholar.google.com/&output=citation&scisdr=CgUtGL0rGAA:AAGBfm0AAAAAYEFG3vaUcfxw7PuK4qhqepUChS_90-KU&scisig=AAGBfm0AAAAAYEFG3m7uEzpv70pEp442XaZBSU6rFmNT&scisf=4&ct=citation&cd=-1&hl=en from this server. (Client IP address: HIDDEN)

\nPlease see Google\'s Terms of Service posted at http://www.google.com/terms_of_service.html\n

\n \r\n That’s all we know.'

FranciscoKnebel commented 3 years ago

image

And it follows according to what I already reported (error on that remap_bib line). I believe the bibtex value here is this invalid HTML, which is a valid status 200 HTTP request, but that returns this blocked HTML I commented above. Is there a test that checks for this page?

If not, the problem is that it might be trying to parse an invalid page.

ipeirotis commented 3 years ago

Searching publications gets blocked quickly. Filling publications (that triggers the bibtex call) get blocked even faster.

I am afraid that I do not have any recommendations for avoiding that behavior from Google Scholar. Perhaps put a shorter timeout for the proxy response, and add a longer time between requests.

rnatella commented 3 years ago

I also sporadically get IndexError. It is a variant of the error in the opening post, my script uses citedby instead of bibtex. It is probably a proxy/network issue. Making requests at a slower rate seems to help. It would be good to have extra error handling in scholarly to prevent the error.

MoritzImendoerffer commented 3 years ago

I did increase the timeout as uniform distribution between 1 and 4 seconds which helps.