probberechts / soccerdata

⛏⚽ Scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.
https://soccerdata.readthedocs.io/en/latest/
Other
529 stars 90 forks source link

Whoscored: downloading games problem #631

Closed Gibranium closed 4 days ago

Gibranium commented 5 days ago

I've tried with a couple of custom leagues, more specifically BEL-Jupiler Pro League and USA-Major League Soccer.

While the Jupiler works fine, the Major League doesn't run properly:

ws = sd.WhoScored("USA-Major League Soccer", "2021", headless=False, no_cache = False)
leagues = ws.read_leagues()
df = ws.read_events()
[07/06/24 12:39:29] INFO     Retrieving calendar for USA-Major League Soccer 2020 (Major League    [whoscored.py](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py):[363](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py#363)
                             Soccer Playoff)                                                                       
                    INFO     [1/2] Retrieving fixtures for USA-Major League Soccer 2020 (Major     [whoscored.py](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py):[391](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py#391)
                             League Soccer Playoff)                                                                
                    INFO     [2/2] Retrieving fixtures for USA-Major League Soccer 2020 (Major     [whoscored.py](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py):[391](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py#391)
                             League Soccer Playoff)                                                                
                    INFO     Retrieving calendar for USA-Major League Soccer 2020 (Major League    [whoscored.py](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py):[363](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py#363)
                             Soccer)                                                                               
                    INFO     [1/2] Retrieving fixtures for USA-Major League Soccer 2020 (Major     [whoscored.py](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py):[391](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py#391)
                             League Soccer)                                                                        
                    INFO     [2/2] Retrieving fixtures for USA-Major League Soccer 2020 (Major     [whoscored.py](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py):[391](file:///Users/davidegualano/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py#391)
                             League Soccer)                                                                        
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File index.pyx:153, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:182, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'game_id'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 df = ws.read_events()

File ~/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/soccerdata/whoscored.py:690, in WhoScored.read_events(self, match_id, force_cache, live, output_fmt, retry_missing, on_error)
    688 team_names = {}
    689 for i, (_, game) in enumerate(iterator.iterrows()):
--> 690     url = urlmask.format(game["game_id"])
    691     # get league and season
    692     logger.info(
    693         "[%s/%s] Retrieving game with id=%s",
    694         i + 1,
    695         len(iterator),
    696         game["game_id"],
    697     )

File ~/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/pandas/core/series.py:1111, in Series.__getitem__(self, key)
   1108     return self._values[key]
   1110 elif key_is_scalar:
-> 1111     return self._get_value(key)
   1113 # Convert generator to list before going through hashable part
   1114 # (We will iterate through the generator there to check for slices)
   1115 if is_iterator(key):

File ~/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/pandas/core/series.py:1227, in Series._get_value(self, label, takeable)
   1224     return self._values[label]
   1226 # Similar to Index.get_value, but we do not fall back to positional
-> 1227 loc = self.index.get_loc(label)
   1229 if is_integer(loc):
   1230     return self._values[loc]

File ~/anaconda3/envs/Soccerdata/lib/python3.11/site-packages/pandas/core/indexes/base.py:3809, in Index.get_loc(self, key)
   3804     if isinstance(casted_key, slice) or (
   3805         isinstance(casted_key, abc.Iterable)
   3806         and any(isinstance(x, slice) for x in casted_key)
   3807     ):
   3808         raise InvalidIndexError(key)
-> 3809     raise KeyError(key) from err
   3810 except TypeError:
   3811     # If we have a listlike key, _check_indexing_error will raise
   3812     #  InvalidIndexError. Otherwise we fall through and re-raise
   3813     #  the TypeError.
   3814     self._check_indexing_error(key)

KeyError: 'game_id'

I've not actually tried all the leagues, so I cannot list which one have this problem.

probberechts commented 5 days ago

Can you first try if it works with

ws = sd.WhoScored("USA-Major League Soccer", "2021", no_cache = True)
df = ws.read_events()
Gibranium commented 4 days ago

Well, It worked, but it's downloading both the 2020 and 2021 Major League seasons under the 2020 year class.

Can I also ask why removing headless and cache made it run?

probberechts commented 4 days ago

Setting no_cache = True works because you probably had some old files with a different structure in your cache. Removing headless didn't do anything: headless = False is the default.

I don't know why it downloads both seasons, but the MLS isn't supported anyway.