nflverse / nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data
https://www.nflfastr.com/
Other
414 stars 50 forks source link

[BUG] Computed WP != Archived WP - Appears OT related. #446

Open andrewtek opened 9 months ago

andrewtek commented 9 months ago

Is there an existing issue for this?

Have you installed the latest development version of the package(s) in question?

If this is a data issue, have you tried clearing your nflverse cache?

I have cleared my nflverse cache and the issue persists.

What version of the package do you have?

nflfastR 4.6.0 4.6.0 4.6.0.9000 dev

Describe the bug

For 48 plays so far in 2023, I have identified that WP computed by nflfastR::build_nflfastR_pbp(game_id) does not match the WP archived in nflfastR::load_pbp(2023).

When this occurs, it appears to apply to games with OT. For instance:

        game_id qtr drive play_id play_type wp_archived wp_computed

1: 2023_01_BUF_NYJ 4 21 3882 0.5037537 0.3967606 2: 2023_01_BUF_NYJ 5 22 3902 kickoff 0.5037537 0.3967606 3: 2023_01_BUF_NYJ 5 22 3918 no_play 0.5037537 0.3967606 4: 2023_01_BUF_NYJ 5 22 3942 pass 0.4389834 0.3538505 5: 2023_01_BUF_NYJ 5 22 3965 run 0.3915674 0.3144271 6: 2023_01_BUF_NYJ 5 22 3987 pass 0.3318032 0.2638814 7: 2023_01_BUF_NYJ 5 22 4010 punt 0.2552552 0.2038933 8: 2023_02_LAC_TEN 4 NA 3966 0.5037537 0.3967606 9: 2023_02_LAC_TEN 5 21 3985 kickoff 0.5037537 0.3967606 10: 2023_02_LAC_TEN 5 21 4001 pass 0.5037537 0.3967606 11: 2023_02_LAC_TEN 5 21 4024 pass 0.4562042 0.3605613 12: 2023_02_LAC_TEN 5 21 4047 pass 0.3898785 0.3070003 13: 2023_02_LAC_TEN 5 21 4070 punt 0.2804652 0.2217094

In the REPREX is a short script that takes a few minutes to run. It compares the archived WP for every 2023 play against the computed WP. Any plays that mismatch within 6 digits are reported in the output.

Reprex

library(dplyr)

#clear cache
nflreadr::.clear_cache()

#load 2023 season directly from archive
pbp <- nflfastR::load_pbp(2023)

#get unique game_ids
game_ids <- unique(pbp$game_id)

# process game ids looking for mismatches
mismatch_dfs <- lapply(game_ids, function(game_id) {
    #output
    nflfastR:::user_message(paste0("Processing game ", game_id, "."), "todo")

    #get subset of plays for the game specified
    archived <- filter(pbp, game_id == game_id)

    #compute value without filling output
    suppressMessages({
      computed <- nflfastR::build_nflfastR_pbp(game_id) %>%
        as.data.frame()
    })

    #merge the two dataframes on common columns
    merged_df <- merge(archived, computed, by = c("game_id", "qtr", "drive", "play_id", "play_type"), suffixes = c("_archived", "_computed"))

    #subset where 'wp' values are different
    result <- subset(merged_df, round(wp_archived, 6) != round(wp_computed, 6))

    #output status
    if (nrow(result) > 0) {
      nflfastR:::user_message(paste0("Processing game ", game_id, " has mismatches."), "info")
    }else{
      nflfastR:::user_message(paste0("Processing game ", game_id, "."), "done")
    }

    #return dataframe
    dplyr::select(result, "game_id", "qtr", "drive", "play_id", "play_type", "wp_archived", "wp_computed")
  })

#combining the mismatched dfs
combined_results <- do.call(rbind, mismatch_dfs)

#output
print(combined_results, n = Inf)

Expected Behavior

I would expect the archived value to match the computed value.

nflverse_sitrep

> nflreadr::nflverse_sitrep()
── System Info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• R version 4.3.2 (2023-10-31 ucrt) • Running under: Windows 11 x64 (build 22621)
── Package Status ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   package installed  cran        dev behind
1 nflfastR     4.6.0 4.6.0 4.6.0.9000    dev
2 nflreadr     1.4.0 1.4.0   1.4.0.09    dev
── Package Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
• No options set for above packages
── Package Dependencies ──────────────────────────────────────────────────────────────────────────────────────────────────────────
• cachem      (1.0.8)   • listenv    (0.9.0)   • utf8      (1.2.3)    
• cli         (3.6.1)   • lubridate  (1.9.3)   • vctrs     (0.6.3)    
• cpp11       (0.4.6)   • magrittr   (2.0.3)   • withr     (2.5.2)    
• curl        (5.1.0)   • memoise    (2.0.1)   • xgboost   (1.7.5.1)  
• data.table  (1.14.8)  • parallelly (1.36.0)  • codetools (0.2-19)   
• digest      (0.6.33)  • pillar     (1.9.0)   • compiler  (4.3.2)    
• dplyr       (1.1.3)   • pkgconfig  (2.0.3)   • graphics  (4.3.2)    
• fansi       (1.0.4)   • progressr  (0.14.0)  • grDevices (4.3.2)    
• fastmap     (1.1.1)   • purrr      (1.0.2)   • grid      (4.3.2)    
• fastrmodels (1.0.2)   • R6         (2.5.1)   • lattice   (0.21-9)   
• furrr       (0.3.1)   • rappdirs   (0.3.3)   • Matrix    (1.6-1.1)  
• future      (1.33.0)  • rlang      (1.1.1)   • methods   (4.3.2)    
• generics    (0.1.3)   • snakecase  (0.11.1)  • mgcv      (1.9-0)    
• globals     (0.16.2)  • stringi    (1.8.1)   • nlme      (3.1-163)  
• glue        (1.6.2)   • stringr    (1.5.1)   • parallel  (4.3.2)    
• hms         (1.1.3)   • tibble     (3.2.1)   • splines   (4.3.2)    
• janitor     (2.2.0)   • tidyr      (1.3.0)   • stats     (4.3.2)    
• jsonlite    (1.8.7)   • tidyselect (1.2.0)   • tools     (4.3.2)    
• lifecycle   (1.0.4)   • timechange (0.2.0)   • utils     (4.3.2)    
── Not Installed ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• nflseedR  • nflplotR    
• nfl4th    • nflverse

Screenshots

No response

Additional context

No response