Sebastian suggested that I open a GitHub issue for play clock stuff, so I'm just going to spill everything I have here so far.
My initial concern was that, in creating this tweet, I found that the pass probability over expected when play_clock == 0 was behaving differently than the trend. I looked into it a little more and found that there were fewer plays than I expected that had zero seconds on the play clock in most seasons. There are also a handful of games that don't have anything other than zeros for the play clock, which to me signals that there may be some kind of error or missing data on the NFL's side.
From what I can tell, there could be several eras of play clock data recorded:
Pre 2012: Nothing available
2012-14: Mostly junk. ~25% of plays have 10sec or 25sec left on play clock. Very few plays under 10sec.
2015-16: Good.
2017: Mostly Good. A bunch of plays right at 40sec and a lot at 0sec.
2018-20: Clean, but zero records at 25sec.
I personally only feel comfortable using the play clock from 2015-2020 where there are no zeros and the clock is less than 25 seconds, but I'm not sure if there is anything you two would like to do about this.
ID'ing games where there are mostly zeros:
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.6.3
#> Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
#> when loading 'dplyr'
#> Warning: package 'ggplot2' was built under R version 3.6.3
#> Warning: package 'tibble' was built under R version 3.6.3
#> Warning: package 'dplyr' was built under R version 3.6.3
library(nflfastR)
pbp_df <- load_pbp(2014:2020)
#> i It is recommended to use parallel processing when trying to load multiple seasons.
#> Please consider running `future::plan("multisession")`!
#> Will go on sequentially...
pbp_df %>%
filter(!is.na(down) & !is.na(posteam) & pass + rush == 1) %>%
group_by(game_id) %>%
summarise(
tot_plays = n(),
play_clock_zero_pct = mean(ifelse(play_clock == 0, 1, 0))
) %>%
arrange(-play_clock_zero_pct)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 1,871 x 3
#> game_id tot_plays play_clock_zero_pct
#> <chr> <int> <dbl>
#> 1 2014_02_SEA_SD 122 1
#> 2 2014_04_JAX_SD 129 1
#> 3 2014_05_NYJ_SD 131 1
#> 4 2014_07_KC_SD 126 1
#> 5 2014_11_OAK_SD 129 1
#> 6 2014_12_STL_SD 127 1
#> 7 2015_04_STL_ARI 124 1
#> 8 2015_09_NYG_TB 135 1
#> 9 2016_03_ATL_NO 142 1
#> 10 2015_10_NO_WAS 119 0.924
#> # ... with 1,861 more rows
Distribution of play clocks by season:
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.6.3
#> Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
#> when loading 'dplyr'
#> Warning: package 'ggplot2' was built under R version 3.6.3
#> Warning: package 'tibble' was built under R version 3.6.3
#> Warning: package 'dplyr' was built under R version 3.6.3
library(nflfastR)
pbp_df <- load_pbp(2012:2020)
#> i It is recommended to use parallel processing when trying to load multiple seasons.
#> Please consider running `future::plan("multisession")`!
#> Will go on sequentially...
pbp_df %>%
filter(!is.na(down) & !is.na(posteam) & pass + rush == 1) %>%
mutate(
play_clock = as.numeric(play_clock),
play_clock = ifelse(play_clock > 40, '>40', play_clock),
play_clock = factor(play_clock, c(0:40, '>40'))
) %>%
group_by(season, play_clock) %>%
summarise(n = n(), .groups = 'drop') %>%
group_by(season) %>%
mutate(freq = n / sum(n)) %>%
ggplot(aes(x = play_clock, y = freq)) +
facet_wrap(~season, ncol = 1, strip.position = 'left') +
scale_x_discrete(breaks = c(0, seq(5, 40, 5), '>40')) +
geom_bar(stat = 'identity', alpha = 0.7) +
theme_light() +
theme(
strip.placement = 'inside',
panel.grid.minor.y = element_blank()
)
Sebastian suggested that I open a GitHub issue for play clock stuff, so I'm just going to spill everything I have here so far.
My initial concern was that, in creating this tweet, I found that the pass probability over expected when
play_clock == 0
was behaving differently than the trend. I looked into it a little more and found that there were fewer plays than I expected that had zero seconds on the play clock in most seasons. There are also a handful of games that don't have anything other than zeros for the play clock, which to me signals that there may be some kind of error or missing data on the NFL's side.From what I can tell, there could be several eras of play clock data recorded:
I personally only feel comfortable using the play clock from 2015-2020 where there are no zeros and the clock is less than 25 seconds, but I'm not sure if there is anything you two would like to do about this.
ID'ing games where there are mostly zeros:
Distribution of play clocks by season: