scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.36k stars 298 forks source link

search_pubs does not work while search_author works perfectly #222

Closed nicolasmauhe closed 3 years ago

nicolasmauhe commented 3 years ago

Using search_author works perfectly, while search_pubs raises "Exception: Cannot fetch the page from Google Scholar". See this example on Colab:

https://colab.research.google.com/drive/1EblDEYpQZMCFef0VBtmhTuHeVmda1w86?usp=sharing

If this comes from a Google ban (which, I understand, is to be suspected with this exception) how come search_author still works like a charm ? Using Tor does not change the result. Scholar.log does not say anything... Thank you for any help with this !

MikhailYankelevich commented 3 years ago

Same here. Worked once for 1 search, but after that - nothing. same thing is happening when using search_pubs_custom_url but not when searching for authors. Using Tor network here

  File "/usr/local/anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/anaconda3/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "googleScholar.py", line 24, in record_citations
    root_url = scholarly.search_pubs(search_string)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/scholarly/_scholarly.py", line 120, in search_pubs
    return self.__nav.search_publications(url)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/scholarly/_navigator.py", line 256, in search_publications
    return _SearchScholarIterator(self, url)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/scholarly/publication_parser.py", line 53, in __init__
    self._load_url(url)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/scholarly/publication_parser.py", line 58, in _load_url
    self._soup = self._nav._get_soup(url)
  File "/usr/local/anaconda3/lib/python3.7/site-packages/scholarly/_navigator.py", line 200, in _get_soup
    html = self._get_page('https://scholar.google.com{0}'.format(url))
  File "/usr/local/anaconda3/lib/python3.7/site-packages/scholarly/_navigator.py", line 152, in _get_page
    raise Exception("Cannot fetch the page from Google Scholar.")
Exception: Cannot fetch the page from Google Scholar.
ipeirotis commented 3 years ago

This means that you need to use proxies and Google is blocking. Notice that in the robots.txt at https://scholar.google.com/robots.txt the author search (/citations?user=) is explicitly allowed, while the publications search URLs are disallowed.

As a workaround, I added my Luminati proxy and it seems to work.

from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.Luminati(usr= "....",passwd ="....", proxy_port  = "...." )
scholarly.use_proxy(pg)
darsh10 commented 3 years ago

Does this exact code work?

darsh10 commented 3 years ago

Exact code with the proxy generator?

MikhailYankelevich commented 3 years ago

@darsh10 yes, and tor works fine too, But google blocks the requests no matter what you use after 10 requests in 1 minute. At least this is the case for me.

nicolasmauhe commented 3 years ago

@ipeirotis Thank you, I'll look in this direction !

(For information, I just realized you can check if you are banned by clicking on "Bibtex" in the citation options (") for a google scholar article. If you are banned, you'll reach a 404 with the terms and conditions of google. Otherwise, you have access to everything in Scholar as usual, which made me doubt I was banned...)

darsh10 commented 3 years ago

Thanks for the response @MikhailYankelevich. Does that mean I can use this for as many links as long as the frequency is less than 10 a minute?

MikhailYankelevich commented 3 years ago

@darsh10 im not sure myself. Google seems to block everything I try with scholarly now. I have random waiting time, so that’s not because of the repeat 🤷‍♂️