probberechts / soccerdata

⛏⚽ Scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.
https://soccerdata.readthedocs.io/en/latest/
Other
511 stars 87 forks source link

[Whoscored] Broken read_schedule method #596

Closed joaomcalves closed 1 month ago

joaomcalves commented 1 month ago

First of all congrats for this awesome repo!

I have been using whoscored scrapper without problems for the last few months. But in the last few days I have been having issues when scraping this year data.

For example if I run ws = sd.WhoScored(leagues=""ENG-Premier League"", seasons=2223) epl_schedule = ws.read_schedule() It works well. But if I run: ws = sd.WhoScored(leagues=""ENG-Premier League"", seasons=2324) epl_schedule = ws.read_schedule()

I get this errror: `TimeoutException Traceback (most recent call last) Cell In[16], line 1 ----> 1 epl_schedule = ws.read_schedule() 2 epl_schedule

File ~/Desktop/football/football_analytics/venv/lib/python3.9/site-packages/soccerdata/whoscored.py:390, in WhoScored.read_schedule(self, force_cache) 387 self._driver.get(url) 389 # Check if season consists of multiple stages --> 390 stages = self._parse_season_stages() 392 # Handle a multi-stage season 393 if len(stages) > 0:

File ~/Desktop/football/football_analytics/venv/lib/python3.9/site-packages/soccerdata/whoscored.py:282, in WhoScored._parse_season_stages(self) 278 def _parse_season_stages(self) -> List[Dict]: 279 match_selector = ( 280 "//div[contains(@id,'tournament-fixture')]//div[contains(@class,'divtable-row')]" 281 ) --> 282 WebDriverWait(self._driver, 30, poll_frequency=1).until( 283 ec.presence_of_element_located((By.XPATH, match_selector)) 284 ) 285 node_stages_selector = "//select[contains(@id,'stages')]/option" 286 node_stages = self._driver.find_elements(By.XPATH, node_stages_selector)

File ~/Desktop/football/football_analytics/venv/lib/python3.9/site-packages/selenium/webdriver/support/wait.py:105, in WebDriverWait.until(self, method, message) 103 if time.monotonic() > end_time: 104 break --> 105 raise TimeoutException(message, screen, stacktrace)

TimeoutException: Message: Stacktrace: 0 undetected_chromedriver 0x00000001008d66c8 undetected_chromedriver + 6149832 1 undetected_chromedriver 0x00000001008cdcea undetected_chromedriver + 6114538 2 undetected_chromedriver 0x000000010035ad5c undetected_chromedriver + 400732 3 undetected_chromedriver 0x00000001003a7aa5 undetected_chromedriver + 715429 4 undetected_chromedriver 0x00000001003a7bf1 undetected_chromedriver + 715761 5 undetected_chromedriver 0x00000001003ecdd4 undetected_chromedriver + 998868 6 undetected_chromedriver 0x00000001003cacdd undetected_chromedriver + 859357 7 undetected_chromedriver 0x00000001003ea0db undetected_chromedriver + 987355 8 undetected_chromedriver 0x00000001003caa53 undetected_chromedriver + 858707 9 undetected_chromedriver 0x000000010039a6d5 undetected_chromedriver + 661205 10 undetected_chromedriver 0x000000010039af6e undetected_chromedriver + 663406 11 undetected_chromedriver 0x0000000100897d00 undetected_chromedriver + 5893376 12 undetected_chromedriver 0x000000010089d4cc undetected_chromedriver + 5915852 13 undetected_chromedriver 0x00000001008798c4 undetected_chromedriver + 5769412 14 undetected_chromedriver 0x000000010089df99 undetected_chromedriver + 5918617 15 undetected_chromedriver 0x000000010086aed4 undetected_chromedriver + 5709524 16 undetected_chromedriver 0x00000001008be018 undetected_chromedriver + 6049816 17 undetected_chromedriver 0x00000001008be1d7 undetected_chromedriver + 6050263 18 undetected_chromedriver 0x00000001008cd89e undetected_chromedriver + 6113438 19 libsystem_pthread.dylib 0x00007ff80bb171d3 _pthread_start + 125 20 libsystem_pthread.dylib 0x00007ff80bb12bd3 thread_start + 15`

Any idea of how I can solve this issue? Thanks!

probberechts commented 1 month ago

Most likely this is related to #581. For previous seasons, the schedule is probably retrieved from the cache.

joaomcalves commented 1 month ago

Oh Thanks @probberechts ! This was a fast response ahah I will test the new version.