nflverse / nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data
[BUG] Computed WP != Archived WP - Appears OT related. #446

Open andrewtek opened 9 months ago

andrewtek commented 9 months ago

Describe the bug

For 48 plays so far in 2023, I have identified that WP computed by nflfastR::build_nflfastR_pbp(game_id) does not match the WP archived in nflfastR::load_pbp(2023).

When this occurs, it appears to apply to games with OT. For instance:

        game_id qtr drive play_id play_type wp_archived wp_computed

1: 2023_01_BUF_NYJ 4 21 3882 0.5037537 0.3967606 2: 2023_01_BUF_NYJ 5 22 3902 kickoff 0.5037537 0.3967606 3: 2023_01_BUF_NYJ 5 22 3918 no_play 0.5037537 0.3967606 4: 2023_01_BUF_NYJ 5 22 3942 pass 0.4389834 0.3538505 5: 2023_01_BUF_NYJ 5 22 3965 run 0.3915674 0.3144271 6: 2023_01_BUF_NYJ 5 22 3987 pass 0.3318032 0.2638814 7: 2023_01_BUF_NYJ 5 22 4010 punt 0.2552552 0.2038933 8: 2023_02_LAC_TEN 4 NA 3966 0.5037537 0.3967606 9: 2023_02_LAC_TEN 5 21 3985 kickoff 0.5037537 0.3967606 10: 2023_02_LAC_TEN 5 21 4001 pass 0.5037537 0.3967606 11: 2023_02_LAC_TEN 5 21 4024 pass 0.4562042 0.3605613 12: 2023_02_LAC_TEN 5 21 4047 pass 0.3898785 0.3070003 13: 2023_02_LAC_TEN 5 21 4070 punt 0.2804652 0.2217094

In the REPREX is a short script that takes a few minutes to run. It compares the archived WP for every 2023 play against the computed WP. Any plays that mismatch within 6 digits are reported in the output.



#clear cache

#load 2023 season directly from archive
pbp <- nflfastR::load_pbp(2023)

#get unique game_ids
game_ids <- unique(pbp$game_id)

# process game ids looking for mismatches
mismatch_dfs <- lapply(game_ids, function(game_id) {
    nflfastR:::user_message(paste0("Processing game ", game_id, "."), "todo")

    #get subset of plays for the game specified
    archived <- filter(pbp, game_id == game_id)

    #compute value without filling output
      computed <- nflfastR::build_nflfastR_pbp(game_id) %>%

    #merge the two dataframes on common columns
    merged_df <- merge(archived, computed, by = c("game_id", "qtr", "drive", "play_id", "play_type"), suffixes = c("_archived", "_computed"))

    #subset where 'wp' values are different
    result <- subset(merged_df, round(wp_archived, 6) != round(wp_computed, 6))

    #output status
    if (nrow(result) > 0) {
      nflfastR:::user_message(paste0("Processing game ", game_id, " has mismatches."), "info")
      nflfastR:::user_message(paste0("Processing game ", game_id, "."), "done")

    #return dataframe
    dplyr::select(result, "game_id", "qtr", "drive", "play_id", "play_type", "wp_archived", "wp_computed")

#combining the mismatched dfs
combined_results <-, mismatch_dfs)

print(combined_results, n = Inf)

Expected Behavior

I would expect the archived value to match the computed value.


> nflreadr::nflverse_sitrep()
── System Info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• R version 4.3.2 (2023-10-31 ucrt) • Running under: Windows 11 x64 (build 22621)
── Package Status ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   package installed  cran        dev behind
1 nflfastR     4.6.0 4.6.0    dev
2 nflreadr     1.4.0 1.4.0    dev
── Package Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
• No options set for above packages
── Package Dependencies ──────────────────────────────────────────────────────────────────────────────────────────────────────────
• cachem      (1.0.8)   • listenv    (0.9.0)   • utf8      (1.2.3)    
• cli         (3.6.1)   • lubridate  (1.9.3)   • vctrs     (0.6.3)    
• cpp11       (0.4.6)   • magrittr   (2.0.3)   • withr     (2.5.2)    
• curl        (5.1.0)   • memoise    (2.0.1)   • xgboost   (  
• data.table  (1.14.8)  • parallelly (1.36.0)  • codetools (0.2-19)   
• digest      (0.6.33)  • pillar     (1.9.0)   • compiler  (4.3.2)    
• dplyr       (1.1.3)   • pkgconfig  (2.0.3)   • graphics  (4.3.2)    
• fansi       (1.0.4)   • progressr  (0.14.0)  • grDevices (4.3.2)    
• fastmap     (1.1.1)   • purrr      (1.0.2)   • grid      (4.3.2)    
• fastrmodels (1.0.2)   • R6         (2.5.1)   • lattice   (0.21-9)   
• furrr       (0.3.1)   • rappdirs   (0.3.3)   • Matrix    (1.6-1.1)  
• future      (1.33.0)  • rlang      (1.1.1)   • methods   (4.3.2)    
• generics    (0.1.3)   • snakecase  (0.11.1)  • mgcv      (1.9-0)    
• globals     (0.16.2)  • stringi    (1.8.1)   • nlme      (3.1-163)  
• glue        (1.6.2)   • stringr    (1.5.1)   • parallel  (4.3.2)    
• hms         (1.1.3)   • tibble     (3.2.1)   • splines   (4.3.2)    
• janitor     (2.2.0)   • tidyr      (1.3.0)   • stats     (4.3.2)    
• jsonlite    (1.8.7)   • tidyselect (1.2.0)   • tools     (4.3.2)    
• lifecycle   (1.0.4)   • timechange (0.2.0)   • utils     (4.3.2)    
── Not Installed ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• nflseedR  • nflplotR    
• nfl4th    • nflverse


Additional context

