nflverse / nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data
https://www.nflfastr.com/
Other
414 stars 50 forks source link

[BUG] Duplicate combinations of `game_id`, `drive`, and `play_id` in the play-by-play data #477

Open isaactpetersen opened 1 month ago

isaactpetersen commented 1 month ago

Is there an existing issue for this?

If this is a data issue, have you tried clearing your nflverse cache?

I have cleared my nflverse cache and the issue persists.

What version of the package do you have?

nflreadr 1.4.1

Describe the bug

Sorry if I'm posting this in the wrong repository. I posted a similar issue in the nflreadr package, and it was moved here. The nflreadr data dictionary for the play-by-play data (from the nflreadr::load_pbp() function) indicates that each unique row (i.e., play) should be uniquely identified by the combination of game_id, drive, and play_id. However, there are duplicate combinations of game_id, drive, and play_id. See reprex below.

Reprex

library("nflreadr")
library("dplyr")
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

pbp <- nflreadr::load_pbp(seasons = TRUE)

pbp <- pbp %>% 
  select(game_id, drive, play_id, everything()) %>% 
  arrange(game_id, drive, play_id)

pbp %>% 
  group_by(game_id, drive, play_id) %>% 
  filter(n() > 1)
#> # A tibble: 10 × 372
#> # Groups:   game_id, drive, play_id [4]
#>    game_id       drive play_id old_game_id home_team away_team season_type  week
#>    <chr>         <dbl>   <dbl> <chr>       <chr>     <chr>     <chr>       <int>
#>  1 2000_03_PIT_…    18    2767 2000091708  CLE       PIT       REG             3
#>  2 2000_03_PIT_…    18    2767 2000091708  CLE       PIT       REG             3
#>  3 2000_03_PIT_…    18    2768 2000091708  CLE       PIT       REG             3
#>  4 2000_03_PIT_…    18    2768 2000091708  CLE       PIT       REG             3
#>  5 2000_06_WAS_…    12    1825 2000100811  PHI       WAS       REG             6
#>  6 2000_06_WAS_…    12    1825 2000100811  PHI       WAS       REG             6
#>  7 2000_06_WAS_…    12    1825 2000100811  PHI       WAS       REG             6
#>  8 2000_11_OAK_…    15    2323 2000111300  DEN       LV        REG            11
#>  9 2000_11_OAK_…    15    2323 2000111300  DEN       LV        REG            11
#> 10 2000_11_OAK_…    15    2323 2000111300  DEN       LV        REG            11
#> # ℹ 364 more variables: posteam <chr>, posteam_type <chr>, defteam <chr>,
#> #   side_of_field <chr>, yardline_100 <dbl>, game_date <chr>,
#> #   quarter_seconds_remaining <dbl>, half_seconds_remaining <dbl>,
#> #   game_seconds_remaining <dbl>, game_half <chr>, quarter_end <dbl>, sp <dbl>,
#> #   qtr <dbl>, down <dbl>, goal_to_go <dbl>, time <chr>, yrdln <chr>,
#> #   ydstogo <dbl>, ydsnet <dbl>, desc <chr>, play_type <chr>,
#> #   yards_gained <dbl>, shotgun <dbl>, no_huddle <dbl>, qb_dropback <dbl>, …

Created on 2024-07-31 with reprex v2.1.1

Expected Behavior

I expect each play to have only one row for a given game_id-drive-play_id combination.

nflverse_sitrep

> nflreadr::nflverse_sitrep()
── System Info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• R version 4.3.1 (2023-06-16 ucrt) • Running under: Windows 11 x64 (build 22631)
── Package Status ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   package installed  cran        dev behind
1   nfl4th     1.0.4 1.0.4 1.0.4.9002    dev
2 nflfastR     4.6.1 4.6.1 4.6.1.9010    dev
3 nflplotR     1.3.1 1.3.1      1.3.1       
4 nflreadr     1.4.1 1.4.1   1.4.1.00       
── Package Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• No options set for above packages
── Package Dependencies ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• askpass     (1.2.0)    • httr         (1.4.7)   • stringi     (1.8.4)       
• backports   (1.5.0)    • isoband      (0.2.7)   • stringr     (1.5.1)       
• base64enc   (0.1-3)    • janitor      (2.2.0)   • sys         (3.4.2)       
• bigD        (0.2.0)    • jquerylib    (0.1.4)   • tibble      (3.2.1)       
• bitops      (1.0-8)    • jsonlite     (1.8.8)   • tidyr       (1.3.1)       
• bslib       (0.8.0)    • juicyjuice   (0.1.0)   • tidyselect  (1.2.1)       
• cachem      (1.1.0)    • knitr        (1.48)    • timechange  (0.3.0)       
• cli         (3.6.3)    • labeling     (0.4.3)   • tinytex     (0.52)        
• colorspace  (2.1-1)    • lifecycle    (1.0.4)   • utf8        (1.2.4)       
• commonmark  (1.9.1)    • listenv      (0.9.1)   • V8          (4.4.2)       
• cpp11       (0.4.7)    • lubridate    (1.9.3)   • vctrs       (0.6.5)       
• curl        (5.2.1)    • magick       (2.8.4)   • viridisLite (0.4.2)       
• data.table  (1.15.4)   • magrittr     (2.0.3)   • withr       (3.0.0)       
• digest      (0.6.36)   • markdown     (1.13)    • xfun        (0.46)        
• dplyr       (1.1.4)    • Matrix       (1.6-5)   • xgboost     (1.7.8.1)     
• evaluate    (0.24.0)   • memoise      (2.0.1)   • xml2        (1.3.6)       
• fansi       (1.0.6)    • mime         (0.12)    • yaml        (2.3.10)      
• farver      (2.1.2)    • munsell      (0.5.1)   • codetools   (0.2-20)      
• fastmap     (1.2.0)    • openssl      (2.2.0)   • compiler    (4.3.1)       
• fastrmodels (1.0.2)    • parallelly   (1.38.0)  • graphics    (4.3.1)       
• fontawesome (0.5.2)    • pillar       (1.9.0)   • grDevices   (4.3.1)       
• fs          (1.6.4)    • pkgconfig    (2.0.3)   • grid        (4.3.1)       
• furrr       (0.3.1)    • progressr    (0.14.0)  • lattice     (0.22-6)      
• future      (1.34.0)   • purrr        (1.0.2)   • MASS        (7.3-60.0.1)  
• generics    (0.1.3)    • R6           (2.5.1)   • Matrix      (1.6-5)       
• ggpath      (1.0.1)    • rappdirs     (0.3.3)   • methods     (4.3.1)       
• ggplot2     (3.5.1)    • RColorBrewer (1.1-3)   • mgcv        (1.9-1)       
• globals     (0.16.3)   • Rcpp         (1.0.13)  • nlme        (3.1-165)     
• glue        (1.7.0)    • reactable    (0.4.4)   • parallel    (4.3.1)       
• gt          (0.11.0)   • reactR       (0.6.0)   • splines     (4.3.1)       
• gtable      (0.3.5)    • rlang        (1.1.4)   • stats       (4.3.1)       
• highr       (0.11)     • rmarkdown    (2.27)    • tools       (4.3.1)       
• hms         (1.1.3)    • sass         (0.4.9)   • utils       (4.3.1)       
• htmltools   (0.5.8.1)  • scales       (1.3.0)     
• htmlwidgets (1.6.4)    • snakecase    (0.11.1)    
── Not Installed ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• nflseedR ()
• nflverse ()
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Screenshots

No response

Additional context

No response