oseymour / ScraperFC

Python package for scraping soccer data from a variety of sources
GNU General Public License v3.0
222 stars 49 forks source link

Problem with Understat's 2022 EPL Data Scraping #1

Closed hedonistrh closed 2 years ago

hedonistrh commented 2 years ago

Hey, thanks for that great repo. 🙇 That is really helpful. I was doing some experimentation with Understat scraper and encountered with Exception when I tried with following

import ScraperFC as sfc
scraper_understat = sfc.Understat()
scraper_understat.scrape_situations(year=2022, league="EPL")

However, if we replace last line with the following

scraper_understat.scrape_situations(year=2021, league="EPL")

It is working without any issue.

I found the reason of the problem but not sure about the exact issue. 👨‍💻 First, let's see full-error message.

====== WebDriver manager ======
Current google-chrome version is 98.0.4758
Get LATEST chromedriver version for 98.0.4758 google-chrome
Driver [/Users/herdogan/.wdm/drivers/chromedriver/mac64/98.0.4758.102/chromedriver] found in cache
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-16-0f69f8f96556> in <module>
      1 import ScraperFC as sfc
      2 scraper_understat = sfc.Understat()
----> 3 scraper_understat.scrape_situations(year=2022, league="EPL")

/usr/local/lib/python3.9/site-packages/ScraperFC/Understat.py in scrape_situations(self, year, league)
    328             # append row
    329             situations = situations.append(
--> 330                 pd.DataFrame(row.reshape(1,-1), columns=situations.columns),
    331                 ignore_index=True
    332             )

/usr/local/lib/python3.9/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    495                 mgr = init_dict({data.name: data}, index, columns, dtype=dtype)
    496             else:
--> 497                 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
    498 
    499         # For data is list-like, or Iterable (will consume into list)

/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    232         block_values = [values]
    233 
--> 234     return create_block_manager_from_blocks(block_values, [columns, index])
    235 
    236 

/usr/local/lib/python3.9/site-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
   1672                 ]
   1673 
-> 1674         mgr = BlockManager(blocks, axes)
   1675         mgr._consolidate_inplace()
   1676         return mgr

/usr/local/lib/python3.9/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    147 
    148         if do_integrity_check:
--> 149             self._verify_integrity()
    150 
    151         # Populate known_consolidate, blknos, and blklocs lazily

/usr/local/lib/python3.9/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    329                 raise construction_error(tot_items, block.shape[1:], self.axes)
    330         if len(self.items) != tot_items:
--> 331             raise AssertionError(
    332                 "Number of manager items must equal union of "
    333                 f"block items\n# manager items: {len(self.items)}, # "

AssertionError: Number of manager items must equal union of block items
# manager items: 46, # tot_items: 37

I checked team-links which is used in this scraping, they looks correct.

['https://understat.com/team/Aston_Villa/2021', 'https://understat.com/team/Southampton/2021', 'https://understat.com/team/Burnley/2021', 'https://understat.com/team/Liverpool/2021', 'https://understat.com/team/Brighton/2021', 'https://understat.com/team/Leeds/2021', 'https://understat.com/team/Leicester/2021', 'https://understat.com/team/Wolverhampton_Wanderers/2021', 'https://understat.com/team/Crystal_Palace/2021', 'https://understat.com/team/Norwich/2021', 'https://understat.com/team/Watford/2021', 'https://understat.com/team/Manchester_City/2021', 'https://understat.com/team/Arsenal/2021', 'https://understat.com/team/Tottenham/2021', 'https://understat.com/team/Everton/2021', 'https://understat.com/team/Newcastle_United/2021', 'https://understat.com/team/Brentford/2021', 'https://understat.com/team/Manchester_United/2021', 'https://understat.com/team/West_Ham/2021', 'https://understat.com/team/Chelsea/2021']

Then, realized that one of them is always failing and failed one is related with "Burnley" as we saw that failure when our link in team_links become following

https://understat.com/team/Burnley/2021

Then, when we check given link from understat, we can see that there is no Penalty under Situation part at all. Burnley's Details So that is coming from not so consistent representation in the website. On the other hand, maybe handling with that situation can be added. Currently I do not have enough time to do that but sharing details in case.

Thanks. 💯

oseymour commented 2 years ago

Hello! Thanks for opening ScraperFC's first ever issue on GitHub! So since Burnley conceded a penalty this weekend, I'm no longer able to replicate this issue. I'll try and put a fix together to hopefully avoid this issue in the future. Thank you!

oseymour commented 2 years ago

I just pushed a fix for this issue. If nobody runs into problems in the next couple of weeks, I'll close it.

hedonistrh commented 2 years ago

Thanks a lot. Also that was nice coincidence. ⚽ So I am not closing it as you mentioned something about that. 👍

oseymour commented 2 years ago

Closed due to no more reported incidences.