jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.23k stars 330 forks source link

Fangraphs missing pitching data #214

Closed michaelmdresser closed 3 years ago

michaelmdresser commented 3 years ago

Fangraphs pitching data appears to be missing many players. Here's a code snippet:

import pybaseball

pstats = pybaseball.pitching_stats(2021)
print(f"Rows for fangraphs pitching stats for 2021: {len(pstats)}")

pstats_bref = pybaseball.pitching_stats_bref(2021)
print(f"Rows for Baseball Reference pitching stats for 2021: {len(pstats_bref)}")

player_ids = pybaseball.playerid_reverse_lookup([593334], key_type="mlbam").iloc[0]
print(player_ids)

fg_stats = pstats.loc[pstats["IDfg"] == player_ids["key_fangraphs"]]
print(f"Fangraphs rows for pitcher Domingo Germán: {len(fg_stats)}")

bref_stats = pstats_bref.loc[pstats_bref["Name"].str.contains("Domingo")]
print(f"Baseball Reference rows for pitcher Domingo Germán: {len(bref_stats)}")

And the output I get:

Rows for fangraphs pitching stats for 2021: 66
Rows for Baseball Reference pitching stats for 2021: 646
Gathering player lookup table. This may take a moment.
name_last              german
name_first            domingo
key_mlbam              593334
key_retro            germd001
key_bbref           germado01
key_fangraphs           17149
mlb_played_first       2017.0
mlb_played_last        2021.0
Name: 0, dtype: object
Fangraphs rows for pitcher Domingo Germán: 0
Baseball Reference rows for pitcher Domingo Germán: 2

66 pitchers for the whole season of 2021 is way too low. Fangraphs has Domingo's data: https://www.fangraphs.com/players/domingo-german/17149/stats?position=P as well.

It looks like the code is just querying the "Leaders" section of Fangraphs https://github.com/jldbc/pybaseball/blob/bbd03a8b00bcd92e568f6907cacbe3c1ae51c0ee/pybaseball/datasources/fangraphs.py#L45, which if I visit https://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=y&type=8&season=2021&month=0&season1=2021&ind=0&page=3_30 has 66 players just like what I'm experiencing in the code.

Based on the documentation, I'd expect pitching_stats to have data for all players from the season. Is that wrong?

schorrm commented 3 years ago

Whoops! Out of date documentation, but see #213 which fixes this.

johnclary commented 3 years ago

and so @michaelmdresser here's your fix:

pstats = pybaseball.pitching_stats(2021, qual=1)
michaelmdresser commented 3 years ago

Thank you!