jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.25k stars 333 forks source link

Baseball Reference Pitcher WAR #9

Closed jfreynolds closed 6 years ago

jfreynolds commented 6 years ago

Are there any plans to add WAR or any stats from the Player Value tables on Baseball Reference to pitching_stats_bref(season)?

I was looking to find the largest difference between bWAR and fWAR for pitchers, but I am unable to without a WAR column in the dataframe that returns from pitching_stats_bref(season). Were there issues in obtaining that data or just never implemented?

jldbc commented 6 years ago

Good question. I didn't realize that wasn't included there. It's also missing in batting_stats_bref(season). The tables these scrape from don't include bWAR, but I'm open to moving these functions over to a better table.

They're currently using Baseball Reference's Daily Gamelog Finder, where batting_stats_bref(season) supplies a season-length date range to batting_stats_range(start_dt,end_dt). Know of a better table that includes WAR + all the other standard stats for each player?

trojanguard25 commented 6 years ago

I'm not aware of any way to query bWAR over a date range. Baseball-reference hosts files that include player bWAR (among other stats that go into their WAR calculations) for batters and pitchers. Every player has an entry broken up by year-team-stint.

http://www.baseball-reference.com/data/war_daily_bat.txt http://www.baseball-reference.com/data/war_daily_pitch.txt

These files are updated daily during the season, as well as during the offseason whenever they make stat adjustments.

I don't think there are analogous files for traditional counting stats, so you will probably need separate interfaces.

jfreynolds commented 6 years ago

My initial thought was get a player's baseball reference ID and just scrape the table from their actual player page. You could sum up something like WAR for a given range, but that feels like more of a band-aid fix. Wouldn't work for aggregating other values over a range.

On top of that, it would be pretty slow all in all.

jldbc commented 6 years ago

I can't see fetching WAR one player at a time scaling well beyond a small number of players.

The data @trojanguard25 mentioned look promising. If there's no single source with WAR and traditional stats side by side, a separate scrape for pulling this data might be the best route forward. From there a user can join the tables together on player id if necessary.

Thoughts/objections?

jfreynolds commented 6 years ago

The daily batting/pitching files seem to be the best option available.

Should all the data form those files be provided in a table to a user by default? Seems like there is a lot in there that isn't regularly sought after. Maybe by default they are provided more common statistics (WAR, salary, WAA, ERA+, etc.) from that file and if a boolean is specified to be true, provide all the data available?

jldbc commented 6 years ago

Most of these could be left out by default since the main point of this is to get WAR. Returning all 49 columns by default might be overkill.

Bare minimum would be WAR, its essential components (WAA and WAR_rep for batters, WAA, WAR_rep, and WAA_adj for pitchers) , and everything needed to identify the player and connect the with another table. I think this would mean WAR, WAR_off, WAR_def, WAR_rep, WAA, mlb_ID, player_ID, team_ID, year_ID, stint_ID for both, plus WAA_adj for pitching unless I'm missing anything.

On top of these it might get a bit arbitrary to decide what to leave in by default. Is anything else important to keep in or should the rest be optional with something along the lines of a boolean return_all parameter? Maybe G for both and GS for pitchers since these are common things people might filter on?

jfreynolds commented 6 years ago

Definitely agree that WAR values should be the default. The only other things that jump out to me that could be frequently requested is ERA+, salary, or even BIP. Other than that, nothing really strikes me.

So, it seems like the best idea would be provide WAR and its components by default, maybe allow specify an argument for some more commonly used columns within such as ERA+, salary, RA, xRA, RAA, BIP, etc. and finally a return_all parameter like you said to return all of the rows.

I just think occasionally people may want a select few values outside of WAR and forcing them read all of the columns seems like unnecessary overhead. Should we just keep it simple though? WAR and its components or if some boolean argument is true, then return all columns?

jldbc commented 6 years ago

Yeah we can keep some of the more commonly used ones in. For non-WAR, non-identification columns of interest I'm seeing:

Batting: salary, G, PA, runs_above_avg, runs_above_avg_off, runs_above_avg_def Pitching: G, GS, RA, xRA, BIP, BIP_perc, salary, ERA_plus

Which all in all would have these as the defaults:

Batting: ['name_common', 'mlb_ID', 'player_ID', 'year_ID', 'team_ID', 'stint_ID', 'lg_ID', 'pitcher', 'G', 'PA', 'salary', 'runs_above_avg', 'runs_above_avg_off', 'runs_above_avg_def', 'WAR_rep', 'WAA', 'WAR'] Pitching: ['name_common' ,'mlb_ID', 'player_ID', 'year_ID', 'team_ID', 'stint_ID', 'lg_ID', 'G', 'GS', 'RA', 'xRA', 'BIP', 'BIP_perc', 'salary', 'ERA_plus', 'WAR_rep', 'WAA', 'WAA_adj', 'WAR']

With everything else being retrievable with a return_all type of parameter. Anything important I missed? This leaves ~ 20 columns each which seems reasonable.

The function itself would basically be the top response to this Stack Overflow post with the above column filtering.

trojanguard25 commented 6 years ago

I think there should be some default 'groupby' that is done to combine player rows for the same year. I committed a potential option in my fork: https://github.com/trojanguard25/pybaseball/commits/cache This function returns all the columns for a single season. By default, it groups the rows so each player has a single entry for the year submitted. I also added an option to split each player by team. I think those are the two most common use-cases. This does cause a problem since some of the columns (like ERA+) cannot be summed or averaged; rather, they need to be weighted by playing time. Not exactly sure the best way to handle that correctly.

jldbc commented 6 years ago

Let's leave the groupby in the hands of the user for now since doing it for them without using proper weights might cause people to unknowingly use bad data (i.e. using a summed/averaged ERA+ without realizing it's not weighted).

I pushed the version I've been using to a new branch in 7b10b8220cc67bf737f2e54f7cd7d4d088b86fe8. I'll merge later today if there aren't any objections.

It's probably worth opening a new issue for working on properly-weighted aggregations since it definitely would be useful to have.

jldbc commented 6 years ago

Merged branch bwar to master. Commit 7b10b8220cc67bf737f2e54f7cd7d4d088b86fe8 adds a bwar_bat() and bwar_pitch() function, each with the optional argument return_all to retrieve all fields.