Open TingxunShi opened 1 year ago
Can you try with scholarly.use_proxy(pg, pg)
and see if that runs successfully?
Can you try with
scholarly.use_proxy(pg, pg)
and see if that runs successfully?
It reports that scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.
. However it seems the proxy works. Modified code snippet is shown as below
import requests
from scholarly import scholarly, ProxyGenerator
proxies = {
"http": "socks5://localhost:1208",
"https": "socks5://localhost:1208"
}
url = 'https://api.ipify.org'
response = requests.get(url, proxies=proxies)
print(response.text) # code 200, returns a US IP address
pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success) # Print True here
scholarly.use_proxy(pg, pg)
search_query = scholarly.search_pubs('Paper title here')
pub = next(search_query)
print(pub.bib['cites'])
Proxy working, with success = True
means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.
Proxy working, with
success = True
means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.
I have considered the case you suggested so I visited Google scholar via web browser from the same proxy and it worked. However I will also follow your suggestion to find a more robust proxy to check.
I have figured out the reason: I am behind a socks proxy but in _proxy_generator.py
if proxy doesn't start with "http", it will add the prefix, so the configuration became "http": "http://socks5://localhost:1208". I removed the corresponding logic and now the response code is 200. However, another bug involving captcha resolving triggered.
Traceback (most recent call last):
File "lib\site-packages\scholarly\_navigator.py", line 132, in _get_page
session = pm._handle_captcha2(pagerequest)
File "lib\site-packages\scholarly\_proxy_generator.py", line 404, in _handle_captcha2
cur_host = urlparse(self._get_webdriver().current_url).hostname
AttributeError: 'NoneType' object has no attribute 'current_url'
The error above regd. catcha failure is definitely a legitimate bug that I'm fixing right now. Thank you for reporting this.
Describe the bug scholarly couldn't work even if I set up proxy and SingleProxy returns True. Code snippet is as below
error reported as:
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.
Additional context Add any other context about the problem here.