SingleProxy returns True but failed to query

TingxunShi commented 1 year ago

Describe the bug scholarly couldn't work even if I set up proxy and SingleProxy returns True. Code snippet is as below

from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success) # True here
scholarly.use_proxy(pg)
search_query = scholarly.search_pubs('A paper title')
pub = next(search_query)
print(pub.bib['cites'])

error reported as:

Traceback (most recent call last):
  File "myenv\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "myenv\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "myenv\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn
    conn.connect()
  File "myenv\lib\site-packages\urllib3\connection.py", line 309, in connect
    conn = self._new_conn()
  File "myenv\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "myenv\lib\site-packages\urllib3\util\retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\fp\fp.py", line 32, in get_proxy_list
    page = requests.get(self.__website(repeat))
  File "myenv\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "myenv\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "myenv\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "myenv\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "myenv\lib\site-packages\requests\adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scratch.py", line 10, in <module>
    scholarly.use_proxy(pg)
  File "myenv\lib\site-packages\scholarly\_scholarly.py", line 78, in use_proxy
    self.__nav.use_proxy(proxy_generator, secondary_proxy_generator)
  File "myenv\lib\site-packages\scholarly\_navigator.py", line 68, in use_proxy
    proxy_works = self.pm2.FreeProxies()
  File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 550, in FreeProxies
    proxy = self._proxy_gen(None)  # prime the generator
  File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 509, in _fp_coroutine
    all_proxies = freeproxy.get_proxy_list(repeat=False)  # free-proxy >= 1.1.0
  File "myenv\lib\site-packages\fp\fp.py", line 35, in get_proxy_list
    raise FreeProxyException(
fp.errors.FreeProxyException: Request to https://www.sslproxies.org failed

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Proxy service: Single Proxy, a socks5 proxy started locally
python version: 3.8
OS: Windows 10
Version 1.7.11

Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.

[ ] Yes, I will create a Pull Request with the bugfix.

Additional context Add any other context about the problem here.

arunkannawadi commented 1 year ago

Can you try with scholarly.use_proxy(pg, pg) and see if that runs successfully?

TingxunShi commented 1 year ago

Can you try with scholarly.use_proxy(pg, pg) and see if that runs successfully?

It reports that scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.. However it seems the proxy works. Modified code snippet is shown as below

import requests
from scholarly import scholarly, ProxyGenerator

proxies = {
    "http": "socks5://localhost:1208",
    "https": "socks5://localhost:1208"
}

url = 'https://api.ipify.org'
response = requests.get(url, proxies=proxies)
print(response.text) # code 200, returns a US IP address

pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success)              # Print True here
scholarly.use_proxy(pg, pg)
search_query = scholarly.search_pubs('Paper title here')
pub = next(search_query)
print(pub.bib['cites'])

arunkannawadi commented 1 year ago

Proxy working, with success = True means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.

TingxunShi commented 1 year ago

Proxy working, with success = True means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.

I have considered the case you suggested so I visited Google scholar via web browser from the same proxy and it worked. However I will also follow your suggestion to find a more robust proxy to check.

TingxunShi commented 1 year ago

I have figured out the reason: I am behind a socks proxy but in _proxy_generator.py if proxy doesn't start with "http", it will add the prefix, so the configuration became "http": "http://socks5://localhost:1208". I removed the corresponding logic and now the response code is 200. However, another bug involving captcha resolving triggered.

Traceback (most recent call last):
  File "lib\site-packages\scholarly\_navigator.py", line 132, in _get_page
    session = pm._handle_captcha2(pagerequest)
  File "lib\site-packages\scholarly\_proxy_generator.py", line 404, in _handle_captcha2
    cur_host = urlparse(self._get_webdriver().current_url).hostname
AttributeError: 'NoneType' object has no attribute 'current_url'

arunkannawadi commented 1 year ago

The error above regd. catcha failure is definitely a legitimate bug that I'm fixing right now. Thank you for reporting this.

scholarly-python-package / scholarly

SingleProxy returns True but failed to query #498