vishaalagartha / basketball_reference_scraper

A python module for scraping static and dynamic content from Basketball Reference.
MIT License
254 stars 91 forks source link

get_team_stats and other methods not returning anything #95

Open harlanv24 opened 1 year ago

harlanv24 commented 1 year ago

I was gathering data using these methods and all of a sudden they stopped working, now returning 'None' instead of a dataframe. For instance, the following line:

print(get_team_stats('MIA', 2013))

Prints out 'None' in the console. What's going on?

amywinecoff commented 1 year ago

This is a consequence of the functions having if r.status_code == 200: as a condition. The results dataframe is initialized as None, which means that if you get a status code other than 200, you'll just get no results. I was able to determine that when I was getting no results, it was because the status code is 429 AKA too many requests. So, even though this library might have eventually worked with hours of requests, the www.basketball-reference.com has probably updated their site to include some sort of rate limiting. You can update the functions to at least return the status code by doing something like the following:

  def get_roster(team, season_end_year):

    r = get(
        f'https://www.basketball-reference.com/teams/{team}/{season_end_year}.html')
    df = None

    try:
    #if r.status_code == 200:
        soup = BeautifulSoup(r.content, 'html.parser')
        table = soup.find('table')
        df = pd.read_html(str(table))[0]
        df.columns = ['NUMBER', 'PLAYER', 'POS', 'HEIGHT', 'WEIGHT', 'BIRTH_DATE',
                      'NATIONALITY', 'EXPERIENCE', 'COLLEGE']
        # remove rows with no player name (this was the issue above)
        df = df[df['PLAYER'].notna()]
        df['PLAYER'] = df['PLAYER'].apply(
            lambda name: remove_accents(name, team, season_end_year))
        # handle rows with empty fields but with a player name.
        df['BIRTH_DATE'] = df['BIRTH_DATE'].apply(
            lambda x: pd.to_datetime(x) if pd.notna(x) else pd.NaT)
        df['NATIONALITY'] = df['NATIONALITY'].apply(
            lambda x: x.upper() if pd.notna(x) else '')
    except Exception as e:
        print(e)
        print(r.status_code)

    return df

I'm currently trying to figure out how to automatically rate limit, but I haven't figured that out. Will follow up if I do!