nflverse / nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data
https://www.nflfastr.com/
Other
414 stars 50 forks source link

Fixing yards_gained and adding fumble_recovery_yards #26

Closed TheMathNinja closed 4 years ago

TheMathNinja commented 4 years ago

I'm noticing an error with yards_gained as it's described. Perhaps adding a couple variables could help.

Looking at 2019_01_HOU_NO , play 3113 is instructive. On this play, Kamara runs for 28 yards, fumbles the ball two yards backwards, and Jared Cook recovers it and takes it two yards forward from there for a total gain of 28. Currently, the yards_gained variable reads 26, when in actuality, 28 yards were gained on the play.

In this case, I believe we should see this breakdown: rush_yards == 26 fumble_recovery_yards == 2 yards_gained == 28.

This is how the NFL scores this play. 26 rush yards for Kamara, 2 fumble recovery yards for Cook. It's worth noting that if Kamara picks up his own fumble in this situation, then it would have simply been 28 rush yards for Kamara. The NFL Guide for Statisticians requires that if a player recovers their own fumble, it just be counted as net yards of the kind they started with, but if a teammate recovers, then it's counted as fumble recovery yardage. Do you think it's possible to perhaps add 3 new variables to the data set? pass_yards, rush_yards, fumble_recovery_yards? Where pass_yards + fumble_recovery_yards always adds to yards_gained on pass plays and rush_yards + fumble_recovery_yards always adds up to yards_gained on pass plays?

TheMathNinja commented 4 years ago

My apologies for not noticing the fumble_recovery_1_yards variable already there. I guess what I'm saying here then is that perhaps it would be useful to separate out just rush_yards from yards_gained here (where rush_yards is the only new variable, and yards_gained represents total yards gained on the play).

TheMathNinja commented 4 years ago

Also, I'd like to add one more play up for discussion in this question of the relationship between fumbles and yards_gained. game_id 2019_10_SEA_SF, play_id 3345 is very instructive. You can view it here: https://www.youtube.com/watch?v=Opu-YTtncog (I know you will love this, Ben).

Seattle starts the play on their own 35, Wilson is sacked for -10 at his own 25, fumbles 2 yards backwards and is recovered at the 23 by Ifeadi who runs 5 yards backwards to the 18, who then fumbles the ball 6 yards backwards to the 12 where it is recovered by Buckner who takes it 12 yards in for the TD.

It seems quite odd to me that this play shows yards_gained = -12. fumble_recovery_1_yards == -11 is correct. fumble_recovery_2_yards == 12 is correct. Should also have some form of sack_yards = -12 as a variable. I think we should then have yards_gained == sack_yards + fumble_recovery_1_yards + -1*(fumble_recovery_2_yards) == -35. Thoughts?

TheMathNinja commented 4 years ago

Also, it's not like you're asking my opinion on this, but for my purposes (and perhaps others), I'd like to suggest the following setup:

rush_yards == Yards gained by rusher on rushing plays pass_yards == Yards gained by passer on passing plays sack_yards == Yards gained by sacked player on sack plays fumble_recovery_yards == Yards gained by all fumble recoveries on possession team yards_gained == Yards gained by possession team net_yards == Yards gained by possession team - Yards gained by defending team In this case, yards_gained == rush_yards + pass_yards + sack_yards + fumble_recovery_yards in all instances, which is a nice result.

In the case of the SEA vs. SF play above, sack_yards == -12 pass_yards == 0 rush_yards == 0 fumble_recovery_yards == fumble_recovery_1_yards == -11 yards_gained = -23 net_yards = yards_gained + -1*fumble_recovery_2_yards == -23 + -12 = -35.

This is my proposal :)

TheMathNinja commented 4 years ago

Sorry for the monologue, but now that I'm looking more closely at the data, I see that currently: yards_gained == rush_yards + pass_yards + sack_yards given how I'm defining those latter 3 variables.

I guess it would make sense not to re-define yards_gained formulaically. But I would suggest the following:

Re-defining the yards_gained variable in the nflfastR from "numeric yards gained (or lost) for the given play" to "numeric yards gained (or lost) by the possessing team, excluding fumble recovery yardage"

I can create all the other variables myself with this new understanding of yards_gained. Up to you whether it would make sense to include any of them in the package itself.

guga31bb commented 4 years ago

Sorry, just catching up on this now! So is the end this one suggestion to change the documentation for yards_gained? This would certainly be easy on our end!

Re-defining the yards_gained variable in the nflfastR from "numeric yards gained (or lost) for the given play" to "numeric yards gained (or lost) by the possessing team, excluding fumble recovery yardage"

TheMathNinja commented 4 years ago

Yes, but I want to amend what I wrote:

Re-defining the yards_gained variable in the nflfastR from "numeric yards gained (or lost) for the given play" to "numeric yards gained (or lost) by the possessing team, excluding yards gained via fumble recoveries and laterals.”

In other words, this yards_gained variable only measures the yards gained by the possession team on the initial play. Once the ball is fumbled or lateraled, anything the possession team gains/loses is no longer counted into this variable. And yards gained or lost by the defensive team are in no way factored into this variable.

And optionally, you could include the rush_yards, pass_yards, sack_yards, and net_yards as variables in the package if you found them helpful.

guga31bb commented 4 years ago

Closing this as the documentation has been updated in our dev branch, which will get pushed here eventually. Thank you!