scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.42k stars 303 forks source link

SingleProxy returns True but failed to query #498

Open TingxunShi opened 1 year ago

TingxunShi commented 1 year ago

Describe the bug scholarly couldn't work even if I set up proxy and SingleProxy returns True. Code snippet is as below

from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success) # True here
scholarly.use_proxy(pg)
search_query = scholarly.search_pubs('A paper title')
pub = next(search_query)
print(pub.bib['cites'])

error reported as:

Traceback (most recent call last):
  File "myenv\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "myenv\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "myenv\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn
    conn.connect()
  File "myenv\lib\site-packages\urllib3\connection.py", line 309, in connect
    conn = self._new_conn()
  File "myenv\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "myenv\lib\site-packages\urllib3\util\retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\fp\fp.py", line 32, in get_proxy_list
    page = requests.get(self.__website(repeat))
  File "myenv\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "myenv\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "myenv\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "myenv\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "myenv\lib\site-packages\requests\adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scratch.py", line 10, in <module>
    scholarly.use_proxy(pg)
  File "myenv\lib\site-packages\scholarly\_scholarly.py", line 78, in use_proxy
    self.__nav.use_proxy(proxy_generator, secondary_proxy_generator)
  File "myenv\lib\site-packages\scholarly\_navigator.py", line 68, in use_proxy
    proxy_works = self.pm2.FreeProxies()
  File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 550, in FreeProxies
    proxy = self._proxy_gen(None)  # prime the generator
  File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 509, in _fp_coroutine
    all_proxies = freeproxy.get_proxy_list(repeat=False)  # free-proxy >= 1.1.0
  File "myenv\lib\site-packages\fp\fp.py", line 35, in get_proxy_list
    raise FreeProxyException(
fp.errors.FreeProxyException: Request to https://www.sslproxies.org failed

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.

Additional context Add any other context about the problem here.

arunkannawadi commented 1 year ago

Can you try with scholarly.use_proxy(pg, pg) and see if that runs successfully?

TingxunShi commented 1 year ago

Can you try with scholarly.use_proxy(pg, pg) and see if that runs successfully?

It reports that scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.. However it seems the proxy works. Modified code snippet is shown as below

import requests
from scholarly import scholarly, ProxyGenerator

proxies = {
    "http": "socks5://localhost:1208",
    "https": "socks5://localhost:1208"
}

url = 'https://api.ipify.org'
response = requests.get(url, proxies=proxies)
print(response.text) # code 200, returns a US IP address

pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success)              # Print True here
scholarly.use_proxy(pg, pg)
search_query = scholarly.search_pubs('Paper title here')
pub = next(search_query)
print(pub.bib['cites'])
arunkannawadi commented 1 year ago

Proxy working, with success = True means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.

TingxunShi commented 1 year ago

Proxy working, with success = True means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.

I have considered the case you suggested so I visited Google scholar via web browser from the same proxy and it worked. However I will also follow your suggestion to find a more robust proxy to check.

TingxunShi commented 1 year ago

I have figured out the reason: I am behind a socks proxy but in _proxy_generator.py if proxy doesn't start with "http", it will add the prefix, so the configuration became "http": "http://socks5://localhost:1208". I removed the corresponding logic and now the response code is 200. However, another bug involving captcha resolving triggered.

Traceback (most recent call last):
  File "lib\site-packages\scholarly\_navigator.py", line 132, in _get_page
    session = pm._handle_captcha2(pagerequest)
  File "lib\site-packages\scholarly\_proxy_generator.py", line 404, in _handle_captcha2
    cur_host = urlparse(self._get_webdriver().current_url).hostname
AttributeError: 'NoneType' object has no attribute 'current_url'
arunkannawadi commented 1 year ago

The error above regd. catcha failure is definitely a legitimate bug that I'm fixing right now. Thank you for reporting this.