Closed FranciscoKnebel closed 3 years ago
Without knowing the query, it is not possible to replicate the issue.
Without knowing the query, it is not possible to replicate the issue.
Of course. The query I am using is just a simple string, with two keywords.
query = '\"digital twin\" \"cloud\"'
On the 5th item with query = '\"digital twin\"'
it happened again.
The bug does not seem to be deterministic, so it may take a couple tries.
Could not replicate.
The following code, executed on Colab:
!pip3 install -U scholarly
from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.Luminati(usr= "....",passwd ="....", proxy_port = "....")
scholarly.use_proxy(pg)
query = '\"digital twin\" \"cloud\"'
search_query = scholarly.search_pubs(query, year_low=2020)
r = next(search_query)
result = scholarly.fill(r)
scholarly.pprint(result)
Generated the following:
{'author_id': ['pRFdsgkAAAAJ', '', '', '7EC2OLgAAAAJ'],
'bib': {'abstract': 'Battery management is critical to enhancing the safety, '
'reliability, and performance of the battery systems. '
'This paper presents a cloud battery management system '
'for battery systems to improve the computational power '
'and data storage capability by cloud',
'author': 'Li, Weihan and Rentemeister, Monika and Badeda, Julia and '
'J{\\"o}st, Dominik and Schulte, Dominik and Sauer, Dirk '
'Uwe',
'bib_id': 'li2020digital',
'journal': 'Journal of Energy Storage',
'pages': '101557',
'pub_type': 'article',
'pub_year': '2020',
'publisher': 'Elsevier',
'title': 'Digital twin for battery systems: Cloud battery management '
'system with online state-of-charge and state-of-health '
'estimation',
'venue': 'Journal of Energy …',
'volume': '30'},
'citedby_url': '/scholar?cites=13311481987179348730&as_sdt=5,33&sciodt=0,33&hl=en',
'eprint_url': 'https://www.sciencedirect.com/science/article/pii/S2352152X20308495',
'filled': True,
'gsrank': 1,
'num_citations': 19,
'pub_url': 'https://www.sciencedirect.com/science/article/pii/S2352152X20308495',
'source': 'PUBLICATION_SEARCH_SNIPPET',
'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3D%2522digital%2Btwin%2522%2B%2522cloud%2522%26hl%3Den%26as_sdt%3D0,33%26as_ylo%3D2020&citilm=1&json=&update_op=library_add&info=-gLWD1fiu7gJ&ei=7RFBYLiYJoi8ywTzlp64DA',
'url_related_articles': '/scholar?q=related:-gLWD1fiu7gJ:scholar.google.com/&scioq=%22digital+twin%22+%22cloud%22&hl=en&as_sdt=0,33&as_ylo=2020',
'url_scholarbib': '/scholar?q=info:-gLWD1fiu7gJ:scholar.google.com/&output=cite&scirp=0&hl=en'}
As I said, it's not deterministic. I'll try creating a test sample.
If it is not deterministic, then it is very likely a proxy/network issue.
On your test sample, did you try running fill(next(...)) multiple times? Problem seems to happen after a couple iterations.
I listed the exact code that I used. I ran it multiple times, without issues.
This is the script i'm running, with removed credentials. I tested with the proxy suggested in https://github.com/scholarly-python-package/scholarly/issues/261, same results.
Basically, the idea is to keep an updated spreadsheet of the resulting publications. If publication already in list it is skipped. There are set limits on amount of publications added and publications parsed. Scholar quickly blocked, so there is no way to test this without a proxy.
from scholarly import scholarly, ProxyGenerator
from datetime import datetime
from scraper_api import ScraperAPIClient
from models import Article, list_has_article
from sheets import get_articles_from_sheet, insert_article_in_sheet
MAX_ARTICLES_PER_RUN=10
MAX_PARSED_ARTICLES=50
class ScraperAPI(ProxyGenerator):
def __init__(self, api_key):
self._api_key = api_key
self._client = ScraperAPIClient(api_key)
assert api_key is not None
super(ScraperAPI, self).__init__()
self._TIMEOUT = 120
self._session = self._client
self._session.proxies = {}
def _new_session(self):
self.got_403 = False
return self._session
def _close_session(self):
pass # no need to close the ScraperAPI client
def main():
print("Starting scholar parser...")
pg = ScraperAPI('...')
scholarly.use_proxy(pg)
scholarly.set_timeout(120)
print("Proxy set")
print("Getting saved articles...")
sheet_articles = get_articles_from_sheet() # this returns a list of Articles.
print("Saved articles received.")
print("Defining search query...")
search_query = scholarly.search_pubs('\"digital twin\" \"cloud\"', year_low=2020)
loop_count = 0
articles_found = 0
print('Starting article loop')
while articles_found < MAX_ARTICLES_PER_RUN and loop_count < MAX_PARSED_ARTICLES:
print('Getting next result...')
result = scholarly.fill(next(search_query))
loop_count += 1
entry = Article(
result["bib"]["title"],
result["bib"]["abstract"],
result["num_citations"],
result["bib"]["journal"],
result["bib"]["pub_type"],
result["pub_url"],
datetime.now().strftime("%d/%m/%Y %H:%M:%S"),
datetime.now().strftime("%d/%m/%Y %H:%M:%S")
)
print('Article \"' + entry.title + '\".')
if list_has_article(sheet_articles, entry):
print('\tArticle already on list. Skipping...')
else:
articles_found += 1
print('\tInserting article in spreadsheet')
insert_article_in_sheet(article=entry)
sheet_articles.append(entry)
print("Parser closing.")
print("\tArticles found:", articles_found)
print("\tSheet articles:", len(sheet_articles))
print("\tLoop count:", loop_count)
if __name__ == '__main__':
main()
Error appeared on this test on the second fill(next(...))
Running on Ubuntu on WSL2, Python 3.8.5, scholarly 1.1.0
I'll try testing with Luminati, instead of this other proxy service.
Update:
Luminati with Data Center is very blocked, always returning status 400/403 to scholar requests. ScraperAPI also gets some status 400/403, but is functioning sometimes.
I ran the script with all logging levels and got this, right before it failed with the reported message. I believe that this error message on blocked requests may need extra error handling. Perhaps it is trying to use this page to parse the bibtex (which would be incorrect, since there are no publications on the blocked error page).
2021-03-04 17:35:20,964 - bibtexparser.bparser - DEBUG - Store comment in list of comments: '<!DOCTYPE html>
Error 403 (Forbidden)!!1 403. That’s an error.
Your client does not have permission to get URL
/scholar.bib?q=info:-gLWD1fiu7gJ:scholar.google.com/&output=citation&scisdr=CgUtGL0rGAA:AAGBfm0AAAAAYEFG3vaUcfxw7PuK4qhqepUChS_90-KU&scisig=AAGBfm0AAAAAYEFG3m7uEzpv70pEp442XaZBSU6rFmNT&scisf=4&ct=citation&cd=-1&hl=en
from this server. (Client IP address: HIDDEN)
\nPlease see Google\'s Terms of Service posted at http://www.google.com/terms_of_service.html\n
\n \r\n That’s all we know.'
And it follows according to what I already reported (error on that remap_bib line). I believe the bibtex value here is this invalid HTML, which is a valid status 200 HTTP request, but that returns this blocked HTML I commented above. Is there a test that checks for this page?
If not, the problem is that it might be trying to parse an invalid page.
Searching publications gets blocked quickly. Filling publications (that triggers the bibtex call) get blocked even faster.
I am afraid that I do not have any recommendations for avoiding that behavior from Google Scholar. Perhaps put a shorter timeout for the proxy response, and add a longer time between requests.
I also sporadically get IndexError
. It is a variant of the error in the opening post, my script uses citedby
instead of bibtex. It is probably a proxy/network issue. Making requests at a slower rate seems to help. It would be good to have extra error handling in scholarly to prevent the error.
I did increase the timeout as uniform distribution between 1 and 4 seconds which helps.
Code executed:
Python complains while executing line 345 of scholarly/publication_parser.py with the following message:
IndexError: list index out of range
Probably due to undefined values, causing an empty list? Well, search query breaks due to this.
Traceback below: