[FBref] 403 error when downloading data

probberechts / soccerdata

⛏⚽ Scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.

https://soccerdata.readthedocs.io/en/latest/

Other

573 stars 101 forks source link

[FBref] 403 error when downloading data #59

Closed koenklomps closed 2 years ago

koenklomps commented 2 years ago

Which Python version are you using?

Python 3.8.13

Which version of soccerdata are you using?

1.0.1

What did you do?

fbref = sd.FBref(leagues="NED-Eredivisie", seasons="2021-2022", proxy='tor') team_season_stats = fbref.read_schedule()

What did you expect to see?

Downloaded team stats

What did you see instead?

requests.exceptions.HTTPError: 403
Client Error: Forbidden for url:
https://fbref.com/en/comps/

probberechts commented 2 years ago

Removing the "user-agent" header seems to fix it. You can remove the following line:

https://github.com/probberechts/soccerdata/blob/50f6fef099761a9fca692dbebb96459fba8b393b/soccerdata/_common.py#L327

However, I do not understand why this causes trouble.

koenklomps commented 2 years ago

I tried deleting that line, but it still didn't work. However, after messing around a little bit more it started working, even with the user-agent line included. Seems to randomly work sometimes, but it other times it throws a 403 or 429 error.

frogman141 commented 2 years ago

One potential cause of the issue is the new bot scrapping rules for FbRef. They've started to ban anyone scrapping the website at a rate faster than 1 request per 3 seconds.

If you look into the _common.py code, you can see rate limit and max delay parameters are set to 0 and are currently inaccessible.

probberechts commented 2 years ago

Indeed, you get a "429 Client Error: Too Many Requests for URL" error if you scrape too fast. Originally the rate limit was set to 1 request per 2 seconds, but it seems they've changed that now to 1 request per 3 seconds. This is actually implemented in fbref.py which overrides the default of "no rate limiting" in _common.py.

The 403 error is a different issue and I am still convinced that it is caused by the user agent headers. I'll create a pull request in a few minutes and it would be great if you could check whether that solves your issues.

frogman141 commented 2 years ago

Hey, quick update. I trained to change the rate_limit to 3 seconds or more, and unfortunately the same error occurred.

probberechts commented 2 years ago

Hey, quick update. I trained to change the rate_limit to 3 seconds or more, and unfortunately the same error occurred.

About which error are you talking now? The 403 or 429 error?

Did you try removing the user agent headers?

frogman141 commented 2 years ago

So the code works now. The quick update above was from me fiddling with the code. I just noticed your hotfix, tried it, and It works fine now. Sorry for the confusion.

probberechts commented 2 years ago

No problem. Thanks for checking!

probberechts commented 2 years ago

Should be fixed in v1.0.2 🚀