meysubb / cfbscrapR-archived

CFB R Package
GNU General Public License v3.0
25 stars 9 forks source link

EPA lower than expected value in Michigan-Rutgers game #20

Closed colintj closed 4 years ago

colintj commented 4 years ago

Description:

The 2019 Michigan-Rutgers game says Michigan's first play on offense, 1st & 10 from the 20, was a run that went for 6 yards, but was worth -0.58 EPA. That's much lower than I expected.

Reprex:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom'
#> when loading 'cfbscrapR'
library(reprex)

pbp_2019 <- data.frame()

for(i in 1:15) {
  data <-
    cfb_pbp_data(year = 2019, season_type = "both", week = i, epa_wpa = TRUE) %>%
    mutate(week = 1, year = 2019)
  data <- data.frame(data)
  pbp_2019 <- bind_rows(pbp_2019, data)

}

pbp_2019 %>%
  filter(offense_play == "Michigan",
         defense_play == "Rutgers") %>%
  select(offense_play,
         defense_play,
         drive_id,
         half,
         clock.minutes,
         clock.seconds,
         offense_score,
         defense_score,
         play_type,
         down,
         distance,
         adj_yd_line,
         yards_gained,
         ep_before,
         ep_after,
         EPA) %>% head()
#>   offense_play defense_play   drive_id half clock.minutes clock.seconds
#> 1     Michigan      Rutgers 4011122251    1            30             0
#> 2     Michigan      Rutgers 4011122251    1            30             0
#> 3     Michigan      Rutgers 4011122251    1            30             0
#> 4     Michigan      Rutgers 4011122251    1            30             0
#> 5     Michigan      Rutgers 4011122251    1            27            52
#> 6     Michigan      Rutgers 4011122252    1            27            52
#>   offense_score defense_score         play_type down distance adj_yd_line
#> 1             0             0              Rush    1       10          80
#> 2             0             0    Pass Reception    2        4          74
#> 3             0             0              Rush    1       10          60
#> 4             0             0              Rush    2        8          58
#> 5             7             0 Passing Touchdown    1       10          48
#> 6             7             0           Kickoff    1        0          78
#>   yards_gained  ep_before   ep_after        EPA
#> 1            6 0.66694425 0.08348048 -0.5834638
#> 2           14 0.08348048 2.05876294  1.9752825
#> 3            2 2.05876294 1.29938456 -0.7593784
#> 4           10 1.29938456 2.93682006  1.6374355
#> 5           48 2.93682006 7.00000000  4.0631799
#> 6           19 1.06557103 0.85929598 -0.2062750

Created on 2020-01-11 by the reprex package (v0.3.0)

R version:

sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.5.3  magrittr_1.5    tools_3.5.3     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.1      stringi_1.4.3   rmarkdown_1.12 
#>  [9] highr_0.8       knitr_1.22      stringr_1.4.0   xfun_0.6       
#> [13] digest_0.6.18   evaluate_0.13

Created on 2020-01-11 by the reprex package (v0.3.0)

norris13 commented 4 years ago

Glad you pointed this out, I'm getting the same values. I ran sixYards <- week5 %>% filter(adj_yd_line == 80, down == 1, distance == 10, yards_gained == 6) to see what other plays fell into the same scenario and they all had different EPA values but the same PPA. Besides the difference in time left in the half, I cant think of any reason they should have different EPAs. I'm just going to use PPA from here on out unless you came up with a different solution

meysubb commented 4 years ago

It looks like a bug in how the EPA calcs are used. I'm using the id_play and a combination of lead/lag that might be the problem (if the id_play is not in the right order that is)

meysubb commented 4 years ago

This should be fixed with the new EP models and new version in general. Closing this