probberechts / soccerdata

⛏⚽ Scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.
https://soccerdata.readthedocs.io/en/latest/
Other
573 stars 101 forks source link

[WhoScored] ConnectionError: Could not download https://www.whoscored.com. #366

Closed ds-oliver closed 11 months ago

ds-oliver commented 12 months ago

I think the logs should have all necessary info to cite this issue.

Imports: import tqdm from pathlib import Path import soccerdata as sd from socceraction.data.opta import OptaLoader import socceraction.spadl as spadl import pandas as pd import datetime import os import warnings import pickle import socceraction.atomic.spadl as atomicspadl import zipfile from io import BytesIO from urllib.request import urlretrieve

Code: `# Initialize the WhoScored object ws = sd.WhoScored( leagues=["ENG-Premier League"], seasons=2223, headless=True )

api = ws.read_events(output_fmt='loader')`

Traceback:

ConnectionError Traceback (most recent call last) /Users/hogan/soccerdata/scrape.ipynb Cell 2 line 8 1 # Initialize the WhoScored object 2 ws = sd.WhoScored( 3 leagues=["ENG-Premier League"], 4 seasons=2223, 5 headless=True 6 ) ----> 8 api = ws.read_events(output_fmt='loader')

File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/whoscored.py:667, in WhoScored.read_events(self, match_id, force_cache, live, output_fmt) 664 urlmask = WHOSCOREDURL + "/Matches/{}/Live" 665 filemask = "events/{}{}/{}.json" --> 667 df_schedule = self.read_schedule(force_cache).reset_index() 668 if match_id is not None: 669 iterator = df_schedule[ 670 df_schedule.game_id.isin([match_id] if isinstance(match_id, int) else match_id) 671 ]

File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/whoscored.py:370, in WhoScored.read_schedule(self, force_cache) 357 def read_schedule(self, force_cache: bool = False) -> pd.DataFrame: 358 """Retrieve the game schedule for the selected leagues and seasons. 359 360 Parameters (...) 368 pd.DataFrame 369 """ --> 370 df_seasons = self.readseasons() 371 filemask = "matches/{}{}.csv" 373 all_schedules = []

File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/whoscored.py:246, in WhoScored.read_seasons(self) 239 def read_seasons(self) -> pd.DataFrame: 240 """Retrieve the selected seasons for the selected leagues. 241 242 Returns 243 ------- 244 pd.DataFrame 245 """ --> 246 df_leagues = self.read_leagues() 248 seasons = [] 249 for lkey, league in df_leagues.iterrows():

File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/whoscored.py:212, in WhoScored.read_leagues(self) 210 url = WHOSCORED_URL 211 filepath = self.data_dir / "tiers.json" --> 212 reader = self.get(url, filepath, var="allRegions") 214 data = json.load(reader) 216 leagues = []

File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/_common.py:132, in BaseReader.get(self, url, filepath, max_age, no_cache, var) 130 if no_cache or self.no_cache or not is_cached: 131 logger.debug("Scraping %s", url) --> 132 return self._download_and_save(url, filepath, var) 133 logger.debug("Retrieving %s from cache", url) 134 assert filepath is not None

File ~/soccerdata/scrape_env/lib/python3.9/site-packages/soccerdata/_common.py:452, in BaseSeleniumReader._download_and_save(self, url, filepath, var) 449 self._driver = self._init_webdriver() 450 continue --> 452 raise ConnectionError("Could not download %s." % url)

ConnectionError: Could not download https://www.whoscored.com/.

Edit to add context/files:

Have since tried running scraper on top of Tor using ='Tor' and by defining proxies as dict.

Screenshot by Dropbox Capture

https://github.com/probberechts/soccerdata/assets/77216918/13aafeb1-2e64-4dac-b115-0799c93e1afb

error.log

OnlineAnalytics commented 11 months ago

Unfortunately it doesn’t look like they’ll do anything to try and fix it. Will need to find another means of scrapping

aegonwolf commented 11 months ago

Hmm, I do get this now too.

aegonwolf commented 11 months ago

Unfortunately it doesn’t look like they’ll do anything to try and fix it. Will need to find another means of scrapping

I think "they" is a single person and this is not necessarily a helpful comment, people have work, life and we enjoy an awesome free package that the author has spent a lot of time and effort building.

OnlineAnalytics commented 11 months ago

Unfortunately it doesn’t look like they’ll do anything to try and fix it. Will need to find another means of scrapping

I think "they" is a single person and this is not necessarily a helpful comment, people have work, life and we enjoy an awesome free package that the author has spent a lot of time and effort building.

I know it's a single person. Hence me using the singular pronoun. You don't really need to try and start drama where there isn't any.

probberechts commented 11 months ago

I do not have this issue, so I am unable to fix it as I would have no way to verify it.

It looks like WhoScored does a security check. I do not know why it does it, but here are two options:

  1. If you only see the "checking if the site connection is secure" window when you use soccerdata and not when manually browsing to the website, it might have detected that you are a bot. Then you might find some help/tips in the undetected-chromedriver repo on how to avoid detection. Also make sure to update the "undetected-chromedriver" dependency to its latests version.
  2. It might be your IP address / location / network provider / ... that triggers the security check. Using a proxy or VPN might resolve it.

What does happen after the "verifying..."? Does it show a captcha? Or does it simply directly redirect to the WhoScored webpage? In that case, a straightforward solution could be to check whether the current page contains the text "checking if the site connection is secure" and wait until it redirects before progressing. You can add that after this line .

ds-oliver commented 11 months ago

I do not have this issue, so I am unable to fix it as I would have no way to verify it.

It looks like WhoScored does a security check. I do not know why it does it, but here are two options:

  1. If you only see the "checking if the site connection is secure" window when you use soccerdata and not when manually browsing to the website, it might have detected that you are a bot. Then you might find some help/tips in the undetected-chromedriver repo on how to avoid detection. Also make sure to update the "undetected-chromedriver" dependency to its latests version.
  2. It might be your IP address / location / network provider / ... that triggers the security check. Using a proxy or VPN might resolve it.

What does happen after the "verifying..."? Does it show a captcha? Or does it simply directly redirect to the WhoScored webpage? In that case, a straightforward solution could be to check whether the current page contains the text "checking if the site connection is secure" and wait until it redirects before progressing.

Funny you should mention adding a wait period. I actually had already done so... Screenshot by Dropbox Capture

Any other suggestions?

TimelessUsername commented 11 months ago

Running headless false (while on selenium 4.12 or under) does the trick

ds-oliver commented 11 months ago

@TimelessUsername @probberechts

Running headless false (while on selenium 4.12 or under) does the trick

This has solved the issue. You have been a huge help @TimelessUsername.

@OnlineAnalytics I'm tagging you so that you can see the resolution, and hoping that you can witness how this is the way that most issues are resolved when it comes to open-source projects as this one. The project relies on the collective community to resolve complicated issues, not just the author, this is how the technology improves and now that we have found a workaround @probberechts can spend his valuable time patching instead of testing.

Thanks all. Closing this now. :)