Scrape Player Projection Data from Fangraphs

jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)

MIT License

1.18k stars 323 forks source link

Scrape Player Projection Data from Fangraphs #335

Open TK2575 opened 1 year ago

TK2575 commented 1 year ago

Introduces a function and related tests and documentation that captures player projection data from Fangraphs. Provides argument options to specify the projection source, position, league and team. Extends the teamid lookup method to provide a fg team ID lookup needed for applying team level filtering using a stored dictionary fixture.

blacktj commented 1 year ago

Sorry commented in line as opposed to on PR:

I think you're doing too much work here there's an api: pitch_df = pd.DataFrame(json.loads(requests.get('https://www.fangraphs.com/api/projections?stats=pit&type=steamer').content

TK2575 commented 1 year ago

Thanks @blacktj, didn't know that API endpoint existed in front of a paywall, that's great! I'm assuming there's some rate limit expectation we'll need to respect like we do with baseball reference? I'll need to dig into this a bit.

blacktj commented 1 year ago

It's non-public and buried in the client-side rendering of the table. I am working on a PR for the prospects endpoint of this as well.. not sure if it's rate limited though. It's wide open. The risk I see is if they do lock it down.

TK2575 commented 1 year ago

I think I'll need to defer to this repo's maintainers as to which approach to take. There's precedence for scraping Fangraphs page source for other methods, though I don't know if that's because either a) we weren't aware of the API at the time or b) the API didn't/doesn't support those data. Querying from the API would certainly be cleaner, but I'd be hesitant in moving forward using a non-public API without some form of developer contract and/or buy-in from this repo's maintainers.

blacktj commented 1 year ago

This is a webscraping repo.. so I'm guessing we don't have a contract to pull the data from their actual website? Is there a difference between grabbing it there or from the API they use to render the table?