probberechts / soccerdata

⛏⚽ Scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.
https://soccerdata.readthedocs.io/en/latest/
Other
580 stars 101 forks source link

WhoScored issue #660

Closed Messe57 closed 2 weeks ago

Messe57 commented 1 month ago

Since I found out about this project few months ago, scraping data has always been very easy, so I am really thankful to who is currently working on it. However, now I'm stuck with a problem that I am not able to solve, but I hope that someone can help me deal with it. I admit that I am a beginner in coding, so it might be very easy to solve, but not with my knowledge. This is my code:

import soccerdata as sd
seasons = [ '2122', '2223', '2324'] # 
leagues = ['ENG-Premier League', 'ITA-Serie A'] #
for season in seasons:
    for league in leagues:
        ws = sd.WhoScored(leagues=league, seasons=season, headless=False, no_cache=True) #
        ws._driver.get("https://www.whoscored.com/")
        ws._driver.execute_script("location = 'https://whoscored.com/'")
        leagues = ws.available_leagues()
        print(leagues)
        schedule = ws.read_schedule(force_cache=True)
        epl_matches = ws.read_events(output_fmt='events')

This is the key error that I am receiving:

KeyError                                  Traceback (most recent call last)
Cell In[10], [line 11](vscode-notebook-cell:?execution_count=10&line=11)
      [9](vscode-notebook-cell:?execution_count=10&line=9) leagues = ws.available_leagues()
     [10](vscode-notebook-cell:?execution_count=10&line=10) print(leagues)
---> [11](vscode-notebook-cell:?execution_count=10&line=11) schedule = ws.read_schedule()
     [12](vscode-notebook-cell:?execution_count=10&line=12) epl_matches = ws.read_events(output_fmt='events') #

File c:\Users\filip\AppData\Local\Programs\Python\Python311\Lib\site-packages\soccerdata\whoscored.py:402, in WhoScored.read_schedule(self, force_cache)
    [389](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:389) def read_schedule(self, force_cache: bool = False) -> pd.DataFrame:  # noqa: C901
    [390](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:390)     """Retrieve the game schedule for the selected leagues and seasons.
    [391](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:391) 
    [392](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:392)     Parameters
   (...)
    [400](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:400)     pd.DataFrame
    [401](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:401)     """
--> [402](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:402)     df_season_stages = self.read_season_stages(force_cache=force_cache)
    [403](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:403)     filemask_schedule = "matches/{}_{}_{}_{}.json"
    [405](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:405)     all_schedules = []

File c:\Users\filip\AppData\Local\Programs\Python\Python311\Lib\site-packages\soccerdata\whoscored.py:331, in WhoScored.read_season_stages(self, force_cache)
    [318](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:318) def read_season_stages(self, force_cache: bool = False) -> pd.DataFrame:
    [319](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:319)     """Retrieve the season stages for the selected leagues.
    [320](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:320) 
    [321](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/soccerdata/whoscored.py:321)     Parameters
...
-> [6249](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/pandas/core/indexes/base.py:6249)         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   [6251](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/pandas/core/indexes/base.py:6251)     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   [6252](file:///C:/Users/filip/AppData/Local/Programs/Python/Python311/Lib/site-packages/pandas/core/indexes/base.py:6252)     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['ENG-Premier League'], dtype='object', name='league')] are in the [index]"

Thank you in advance.

probberechts commented 1 month ago

Could you try with caching disabled everywhere?

schedule = ws.read_schedule(force_cache=False)

I am also intrigued why you added

ws._driver.get("https://www.whoscored.com/")
ws._driver.execute_script("location = 'https://whoscored.com/'")

Do you experience any problems without doing this?

Messe57 commented 1 month ago

I tried what you suggested, but still not working unfortunately. I added ws._driver.get("https://www.whoscored.com/") ws._driver.execute_script("location = 'https://whoscored.com/'") because the driver was opening with my native language and so it was an issue. I found out this solution in the issues observed before and it works perfectly until now.

LoGreHub commented 1 month ago

Hi all,

same kind of issue here.

Code:

import soccerdata as sd
ws = sd.WhoScored(leagues = ['ITA-Serie A'], seasons = ['2122'])
ws.read_schedule()

and traceback:

Traceback (most recent call last)
Cell In[8], line 1
----> 1 ws.read_schedule()

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\soccerdata\whoscored.py:344](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/soccerdata/whoscored.py#line=343), in WhoScored.read_schedule(self, force_cache)
    331 def read_schedule(self, force_cache: bool = False) -> pd.DataFrame:
    332     """Retrieve the game schedule for the selected leagues and seasons.
    333 
    334     Parameters
   (...)
    342     pd.DataFrame
    343     """
--> 344     df_season_stages = self.read_season_stages(force_cache=force_cache)
    345     filemask_schedule = "matches/{}_{}_{}_{}.json"
    347     all_schedules = []

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\soccerdata\whoscored.py:274](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/soccerdata/whoscored.py#line=273), in WhoScored.read_season_stages(self, force_cache)
    261 def read_season_stages(self, force_cache: bool = False) -> pd.DataFrame:
    262     """Retrieve the season stages for the selected leagues.
    263 
    264     Parameters
   (...)
    272     pd.DataFrame
    273     """
--> 274     df_seasons = self.read_seasons()
    275     filemask = "seasons/{}_{}.html"
    277     season_stages = []

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\soccerdata\whoscored.py:225](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/soccerdata/whoscored.py#line=224), in WhoScored.read_seasons(self)
    218 def read_seasons(self) -> pd.DataFrame:
    219     """Retrieve the selected seasons for the selected leagues.
    220 
    221     Returns
    222     -------
    223     pd.DataFrame
    224     """
--> 225     df_leagues = self.read_leagues()
    227     seasons = []
    228     for lkey, league in df_leagues.iterrows():

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\soccerdata\whoscored.py:210](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/soccerdata/whoscored.py#line=209), in WhoScored.read_leagues(self)
    199     for league in region["tournaments"]:
    200         leagues.append(
    201             {
    202                 "region_id": region["id"],
   (...)
    206             }
    207         )
    209 return (
--> 210     pd.DataFrame(leagues)
    211     .assign(league=lambda x: x.region + " - " + x.league)
    212     .pipe(self._translate_league)
    213     .set_index("league")
    214     .loc[self._selected_leagues.keys()]
    215     .sort_index()
    216 )

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1191](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexing.py#line=1190), in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1420](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexing.py#line=1419), in _LocIndexer._getitem_axis(self, key, axis)
   1417     if hasattr(key, "ndim") and key.ndim > 1:
   1418         raise ValueError("Cannot index with multidimensional key")
-> 1420     return self._getitem_iterable(key, axis=axis)
   1422 # nested tuple slicing
   1423 if is_nested_tuple(key, labels):

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1360](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexing.py#line=1359), in _LocIndexer._getitem_iterable(self, key, axis)
   1357 self._validate_key(key, axis)
   1359 # A collection of keys
-> 1360 keyarr, indexer = self._get_listlike_indexer(key, axis)
   1361 return self.obj._reindex_with_indexers(
   1362     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
   1363 )

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1558](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexing.py#line=1557), in _LocIndexer._get_listlike_indexer(self, key, axis)
   1555 ax = self.obj._get_axis(axis)
   1556 axis_name = self.obj._get_axis_name(axis)
-> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1560 return keyarr, indexer

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexes\base.py:6200](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexes/base.py#line=6199), in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexes\base.py:6249](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexes/base.py#line=6248), in Index._raise_if_missing(self, key, indexer, axis_name)
   6247 if nmissing:
   6248     if nmissing == len(indexer):
-> 6249         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6252     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['ITA-Serie A'], dtype='object', name='league')] are in the [index]"
Traceback (most recent call last)
Cell In[8], line 1
----> 1 ws.read_schedule()

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\soccerdata\whoscored.py:344](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/soccerdata/whoscored.py#line=343), in WhoScored.read_schedule(self, force_cache)
    331 def read_schedule(self, force_cache: bool = False) -> pd.DataFrame:
    332     """Retrieve the game schedule for the selected leagues and seasons.
    333 
    334     Parameters
   (...)
    342     pd.DataFrame
    343     """
--> 344     df_season_stages = self.read_season_stages(force_cache=force_cache)
    345     filemask_schedule = "matches/{}_{}_{}_{}.json"
    347     all_schedules = []

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\soccerdata\whoscored.py:274](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/soccerdata/whoscored.py#line=273), in WhoScored.read_season_stages(self, force_cache)
    261 def read_season_stages(self, force_cache: bool = False) -> pd.DataFrame:
    262     """Retrieve the season stages for the selected leagues.
    263 
    264     Parameters
   (...)
    272     pd.DataFrame
    273     """
--> 274     df_seasons = self.read_seasons()
    275     filemask = "seasons/{}_{}.html"
    277     season_stages = []

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\soccerdata\whoscored.py:225](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/soccerdata/whoscored.py#line=224), in WhoScored.read_seasons(self)
    218 def read_seasons(self) -> pd.DataFrame:
    219     """Retrieve the selected seasons for the selected leagues.
    220 
    221     Returns
    222     -------
    223     pd.DataFrame
    224     """
--> 225     df_leagues = self.read_leagues()
    227     seasons = []
    228     for lkey, league in df_leagues.iterrows():

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\soccerdata\whoscored.py:210](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/soccerdata/whoscored.py#line=209), in WhoScored.read_leagues(self)
    199     for league in region["tournaments"]:
    200         leagues.append(
    201             {
    202                 "region_id": region["id"],
   (...)
    206             }
    207         )
    209 return (
--> 210     pd.DataFrame(leagues)
    211     .assign(league=lambda x: x.region + " - " + x.league)
    212     .pipe(self._translate_league)
    213     .set_index("league")
    214     .loc[self._selected_leagues.keys()]
    215     .sort_index()
    216 )

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1191](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexing.py#line=1190), in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1420](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexing.py#line=1419), in _LocIndexer._getitem_axis(self, key, axis)
   1417     if hasattr(key, "ndim") and key.ndim > 1:
   1418         raise ValueError("Cannot index with multidimensional key")
-> 1420     return self._getitem_iterable(key, axis=axis)
   1422 # nested tuple slicing
   1423 if is_nested_tuple(key, labels):

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1360](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexing.py#line=1359), in _LocIndexer._getitem_iterable(self, key, axis)
   1357 self._validate_key(key, axis)
   1359 # A collection of keys
-> 1360 keyarr, indexer = self._get_listlike_indexer(key, axis)
   1361 return self.obj._reindex_with_indexers(
   1362     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
   1363 )

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexing.py:1558](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexing.py#line=1557), in _LocIndexer._get_listlike_indexer(self, key, axis)
   1555 ax = self.obj._get_axis(axis)
   1556 axis_name = self.obj._get_axis_name(axis)
-> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1560 return keyarr, indexer

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexes\base.py:6200](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexes/base.py#line=6199), in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File [~\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexes\base.py:6249](http://localhost:8888/~/AppData/Local/Programs/Python/Python312/Lib/site-packages/pandas/core/indexes/base.py#line=6248), in Index._raise_if_missing(self, key, indexer, axis_name)
   6247 if nmissing:
   6248     if nmissing == len(indexer):
-> 6249         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6252     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['ITA-Serie A'], dtype='object', name='league')] are in the [index]"
LoGreHub commented 1 month ago

Read other issues at last (apologize for not doing that before), my above issue is likely related to being forced to load the italian version of the website.

Messe57 commented 1 month ago

Updating my issue... I think it might be a problem with read_schedule function because when I ask for the available leagues the code run perfectly. Furthermore, the chromedriver is able to open the website without any issues. Do you suggest any additional changes to do?

AbinThomas10 commented 3 weeks ago

same issue with me too but only for scraping Italian SerieA league

probberechts commented 2 weeks ago

The following works fine for me:

 import soccerdata as sd
 ws = sd.WhoScored(leagues = ['ITA-Serie A'], seasons = ['2122'], no_cache = True)
 ws.read_schedule()

I am closing this since I don't have sufficient information to debug your issue. Feel free to reopen if you can pinpoint the cause.