Closed tonyelhabr closed 2 years ago
I can't reproduce this and I've got no clue what could be the problem here.
Could you check the following:
Does it work with curl?
$ curl -x socks5h://localhost:9050 https://check.torproject.org/api/ip
{"IsTor":true,"IP":"..."}
Does it work with google chrome?
Launch Chrome with
$ google-chrome --user-data-dir="/tmp" --proxy-server="socks5://127.0.0.1:9050" --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE myproxy"
and check whether Tor works by browsing to https://check.torproject.org/
tony@desktop:/c/Users/antho$ curl -x socks5h://localhost:9050 https://check.torproject.org/api/ip
{"IsTor":true,"IP":"..."}
google-chrome
commandC:\Program Files\Google\Chrome\Application>chrome.exe --user-data-dir="c:/users/antho/downloads" --proxy-server="socks5://127.0.0.1:9050" --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE 127.0.0.1"
I've tried setting path_to_browser
to my chromedriver and the normal chrome executable. I've also tried not setting it. All result in the same error 🤷
ws = sd.WhoScored(leagues="ENG-Premier League", seasons=2021, use_tor=True, path_to_browser="c:\\users\\antho\\downloads\\chromedriver.exe")
ws = sd.WhoScored(leagues="ENG-Premier League", seasons=2021, use_tor=True, path_to_browser="C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe")
ws = sd.WhoScored(leagues="ENG-Premier League", seasons=2021, use_tor=True)
I'm not super familiar with python debugging. Is there a good way for me to stop the execution somewhere in the selenium call for self.execute(Command.GET, {'url': url})
? This seems to be where the error handling dispatches.
Ok. Tor clearly functions properly. That means it has to be an issue with selenium / undetected_chromedriver.
I use undetected_chromedriver, which is a patched version of the original chromedriver to avoid detection by bot mitigation systems. I would first make sure it is not this patched version that causes your problem by running the code below. You'll first have to download the appropriate chromedriver version for your system from https://chromedriver.chromium.org/downloads.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
proxy = "socks5://127.0.0.1:9050"
resolver_rules = "MAP * 0.0.0.0 , EXCLUDE myproxy"
chrome_options.add_argument("--headless") # maybe try without this line too
chrome_options.add_argument("--proxy-server=" + proxy)
chrome_options.add_argument("--host-resolver-rules=" + resolver_rules)
driver = webdriver.Chrome('<path to...>/chromedriver', options=chrome_options)
driver.get("https://check.torproject.org/api/ip")
driver.page_source
If this does not work, you could try with some additional arguments (Google for "windows selenium tor proxy") or create an issue in the selenium repo.
If it works and the code below does not (it shouldn't as this snippet is copied from soccerdata's source code), it is an issue with undetected-chromedriver and you should create an issue here.
import undetected_chromedriver as uc
proxy = "socks5://127.0.0.1:9050"
resolver_rules = "MAP * 0.0.0.0 , EXCLUDE myproxy"
chrome_options.add_argument("--headless") # maybe try without this line too
chrome_options.add_argument("--proxy-server=" + proxy)
chrome_options.add_argument("--host-resolver-rules=" + resolver_rules)
driver = uc.Chrome(options=chrome_options)
driver.get("https://check.torproject.org/api/ip")
driver.page_source
The major thing I had to change with your snippets is replace myproxy
with the actual value of the proxy 127.0.0.1
. Is that supposed to be an environment variable?
The first worked for me on my second try. My first try was blocked, so I see why you might prefer undetected_chromedriver
.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
proxy = "socks5://127.0.0.1:9050"
resolver_rules = "MAP * 0.0.0.0 , EXCLUDE 127.0.0.1"
chrome_options.add_argument("--proxy-server=" + proxy)
chrome_options.add_argument("--host-resolver-rules=" + resolver_rules)
driver = webdriver.Chrome('c:\\users\\antho\\downloads\\chromedriver.exe', options=chrome_options)
driver.get("https://check.torproject.org/api/ip")
driver.page_source
'<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"IsTor":true,"IP":"..."}</pre></body></html>'
import json
driver.get('https://www.whoscored.com/Matches/1485477/Live/England-Premier-League-2020-2021-Crystal-Palace-Manchester-City')
element = driver.find_element(by='xpath',value='//*[@id="layout-wrapper"]/script[1]')
script_content = element.get_attribute('innerHTML')
script_ls = script_content.split(sep=" ")
script_ls = list(filter(None, script_ls))
script_ls = [name for name in script_ls if name.strip()]
dictstring = script_ls[2][17:-2]
matchdict = json.loads(dictstring)
matchdict['score']
'0 : 2'
The second snippet with undetected_chromedriver
worked, after the replacement of myproxy
.
import undetected_chromedriver as uc
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
proxy = "socks5://127.0.0.1:9050"
resolver_rules = "MAP * 0.0.0.0 , EXCLUDE 127.0.0.1"
chrome_options.add_argument("--proxy-server=" + proxy)
chrome_options.add_argument("--host-resolver-rules=" + resolver_rules)
driver = uc.Chrome(options=chrome_options)
driver.get("https://check.torproject.org/api/ip") ## worked
driver.get('https://www.whoscored.com/Matches/1485477/Live/England-Premier-League-2020-2021-Crystal-Palace-Manchester-City')
element = driver.find_element(by='xpath',value='//*[@id="layout-wrapper"]/script[1]')
This gist seems to indicate that we need the value of the proxy specified in resolver_rules
Oh yes, that makes sense! I copy-pasted the resolver rules and forgot to change myproxy
to 127.0.0.1
. Actually, it is odd that it works on my system.
Thanks for debugging this! I'll push a fix in a couple of minutes.
happy to help!
I tried to set
use_tor=True
for downloading events for a match with tor running in the background, butread_events
ended with an error indicating that the proxy connection failed.Here's what my terminal looks like with tor running (prior to calling
read_events()
I've opened my browser to the port to verify that something is running, although this is using an HTTP proxy, so the warning here is expected.