probberechts / soccerdata

⛏⚽ Scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.
https://soccerdata.readthedocs.io/en/latest/
Other
598 stars 103 forks source link

[WhoScored] Failure to parse date and time for schedule #66

Closed giochi99 closed 2 years ago

giochi99 commented 2 years ago

Which Python version are you using?

Python 3.10.5

Which version of soccerdata are you using?

1.0.3

What did you do?

ws = sd.WhoScored(leagues="ITA-Serie A", seasons='21-22', proxy='tor', headless=False)

seriea_2122_schedule = ws.read_schedule()
seriea_2122_schedule.head()

What did you expect to see?

Download schedule data

What did you see instead?

[07/20/22 21:39:51] INFO     Saving cached data to /home/giochi99/soccerdata/data/WhoScored           [_common.py](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/_common.py):[89](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/_common.py#89)

[07/20/22 21:39:52] INFO     patching driver executable                                              [patcher.py](file:///home/giochi99/.local/lib/python3.10/site-packages/undetected_chromedriver/patcher.py):[231](file:///home/giochi99/.local/lib/python3.10/site-packages/undetected_chromedriver/patcher.py#231)
                             /home/giochi99/.local/share/undetected_chromedriver/aa95ea2fc3bf32fc_chromedriver                                                                            

[07/20/22 21:41:36] INFO     Scraping game schedule from                                           [whoscored.py](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py):[325](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py#325)
                             https://www.whoscored.com/Regions/108/Tournaments/5/Seasons/8735/Stages/19982/Fixtures/Italy-Serie-A-2021-2022                                             

[07/20/22 21:41:42] INFO     Scraping game schedule for Sunday, May 1 2022                         [whoscored.py](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py):[239](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py#239)

[07/20/22 21:41:43] INFO     Scraping game schedule for Monday, May 2 2022                         [whoscored.py](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py):[239](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py#239)

                    INFO     Scraping game schedule for Thursday, May 5 2022                       [whoscored.py](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py):[239](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py#239)

                    INFO     Scraping game schedule for Friday, May 6 2022                         [whoscored.py](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py):[239](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py#239)

                    INFO     Scraping game schedule for Saturday, May 7 2022                       [whoscored.py](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py):[239](file:///home/giochi99/.local/lib/python3.10/site-packages/soccerdata/whoscored.py#239)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [12], in <cell line: 3>()
      1 ws = sd.WhoScored(leagues="ITA-Serie A", seasons='21-22', proxy='tor', headless=False)
----> 3 seriea_2122_schedule = ws.read_schedule()
      4 seriea_2122_schedule.head()

File ~/.local/lib/python3.10/site-packages/soccerdata/whoscored.py:326, in WhoScored.read_schedule(self, force_cache)
    324         self._driver.get(url)
    325     logger.info("Scraping game schedule from %s", url)
--> 326     schedule.extend(self._parse_schedule())
    327 df_schedule = pd.DataFrame(schedule).assign(league=lkey, season=skey)
    328 if not self.no_store:

File ~/.local/lib/python3.10/site-packages/soccerdata/whoscored.py:253, in WhoScored._parse_schedule(self, stage)
    251 schedule = []
    252 # Parse first page
--> 253 page_schedule, next_page = self._parse_schedule_page()
    254 schedule.extend(page_schedule)
    255 # Go to next page

File ~/.local/lib/python3.10/site-packages/soccerdata/whoscored.py:213, in WhoScored._parse_schedule_page(self)
    209 if node.get_attribute("data-id"):
    210     time_str = node.find_element(By.XPATH, "./div[contains(@class,'time')]").text
    211     schedule_page.append(
    212         {
--> 213             "date": datetime.strptime(f"{date_str} {time_str}", "%A, %b %d %Y %H:%M"),
    214             "home_team": node.find_element(
    215                 By.XPATH, "./div[contains(@class,'team home')]//a"
    216             ).text,
    217             "away_team": node.find_element(
    218                 By.XPATH, "./div[contains(@class,'team away')]//a"
    219             ).text,
    220             # fmt: off
    221             "game_id": int(
    222                 re.search(
    223                     r"Matches/(\d+)/",
    224                     node.find_element(
    225                         By.XPATH,
    226                         "./div[contains(@class,'result')]//a"
    227                     ).get_attribute("href")).group(1)  # type: ignore
    228             ),
    229             # fmt: on
    230             "url": node.find_element(
    231                 By.XPATH, "./div[contains(@class,'result')]//a"
    232             ).get_attribute("href"),
    233         }
    234     )
    235 else:
    236     date_str = node.find_element(
    237         By.XPATH, "./div[contains(@class,'divtable-header')]"
    238     ).text

File /usr/lib/python3.10/_strptime.py:568, in _strptime_datetime(cls, data_string, format)
    565 def _strptime_datetime(cls, data_string, format="%a %b %d %H:%M:%S %Y"):
    566     """Return a class cls instance based on the input string and the
    567     format string."""
--> 568     tt, fraction, gmtoff_fraction = _strptime(data_string, format)
    569     tzname, gmtoff = tt[-2:]
    570     args = tt[:6] + (fraction,)

File /usr/lib/python3.10/_strptime.py:349, in _strptime(data_string, format)
    347 found = format_regex.match(data_string)
    348 if not found:
--> 349     raise ValueError("time data %r does not match format %r" %
    350                      (data_string, format))
    351 if len(data_string) != found.end():
    352     raise ValueError("unconverted data remains: %s" %
    353                       data_string[found.end():])

ValueError: time data 'Saturday, May 7 2022 ' does not match format '%A, %b %d %Y %H:%M'

Once, this error happened also with Premier League, but after another attempt it disappeared. With Serie A it happens every time.

probberechts commented 2 years ago

It bumped on a game without a time specified. However, it seems to work fine for me and the time_str seems to be present for each game. Could you dump the HTML of the page when this happens?

If you need a quick fix and do not mind that the time is incorrect you could specify a default time_str on line 211.

if not time_str:
    time_str = "20:00"