oseymour / ScraperFC

Python package for scraping soccer data from a variety of sources
GNU General Public License v3.0
222 stars 49 forks source link

FBRef Scraping Match for MLS #8

Closed hedonistrh closed 1 year ago

hedonistrh commented 1 year ago

Hey, I realized that MLS related match scraping has issue. That is coming from MLS has 3 words in it. But in this code-block, we are checking for single or 2 words. So that MLS related one is failing. (BTW, thanks a lot for great comment. 👏 )

To solve that, possibly one more try-except can be added. That is possibly not most clear way but I am not sure are there anything else which solve that way clear. 🤔

To re-produce issue

import ScraperFC as sfc
scraper_fbref = sfc.FBRef()
year = 2021
league_name = "MLS"
example = scraper_fbref.scrape_match(link='https://fbref.com/en/matches/920ed404/Los-Angeles-FC-Minnesota-United-July-28-2021-Major-League-Soccer', year=year, league=league_name)

I did following to solve that issue

        # Get date of the match
        try:
            # Try this first. Assumes league name is one word
            date_elements = link.split("/")[-1].split("-")[-4:-1]
            date = '-'.join(date_elements)
            date = datetime.datetime.strptime(date,'%B-%d-%Y').date()
        except:
            try:
                # Assumes league name is two words
                date_elements = link.split("/")[-1].split("-")[-5:-2]
                date = '-'.join(date_elements)
                date = datetime.datetime.strptime(date,'%B-%d-%Y').date()
            except:
                # Assumes league name is three words
                date_elements = link.split("/")[-1].split("-")[-6:-3]
                date = '-'.join(date_elements)
                date = datetime.datetime.strptime(date,'%B-%d-%Y').date()

However, this time there is another issue with match-week part. That is coming from how FBRef enter matchweek data though. For instance if you check following links

We can see that one has Matchweek 1 as data. Other one has Regular Season. That is coming from how MLS is built though. As it has playoff etc, that kind of corner-cases are not so surprising 😞

To solve that issue, we can use following. That is not clean at all. But as FBRef does not provide round information for MLS, we may skip fully or keep info about it is regular season or play-off.

        if league != "MLS":
            matchweek = int(
                dom.xpath('//*[@id="content"]/div[2]/div[3]/div[2]/text()')[0]\
                    .split('Matchweek')[-1]\
                    .replace(')','')\
                    .strip()
            )
        else:
            matchweek = dom.xpath('//*[@id="content"]/div[2]/div[3]/div[2]/text()')[0].replace(')','').replace('(', '').strip()