nflverse / nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data
https://www.nflfastr.com/
Other
425 stars 52 forks source link

[BUG] nflfastr::calculate_player_stats returns duplicate rows for defense and kicker #476

Open isaactpetersen opened 3 months ago

isaactpetersen commented 3 months ago

Is there an existing issue for this?

If this is a data issue, have you tried clearing your nflverse cache?

I have cleared my nflverse cache and the issue persists.

What version of the package do you have?

nflreadr 1.4.1

Describe the bug

There are duplicated combinations of player_id-season-week combinations in the player stats database (from the load_player_stats() function). I cannot think of a reason why the same player would have multiple rows for a given season and week combination. If (as I suspect), this is not possible, then this would be a data issue to fix. If I'm incorrect and it is plausible that the same player could have multiple rows for a given season and week combination, then it would be helpful to know the circumstances when this could arise. This is important for merging with other datasets to ensure I am merging the information to the correct player_id-season-week combination.

Reprex

library("nflreadr")
library("dplyr")
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Load Data
offenseStats_weekly <- load_player_stats(
    seasons = TRUE,
    stat_type = "offense")

defenseStats_weekly <- load_player_stats(
    seasons = TRUE,
    stat_type = "defense")

kickingStats_weekly <- load_player_stats(
    seasons = TRUE,
    stat_type = "kicking")

# Rearrange variables
offenseStats_weekly <- offenseStats_weekly %>% 
  select(player_id, season, week, everything())

defenseStats_weekly <- defenseStats_weekly %>% 
  select(player_id, season, week, everything())

kickingStats_weekly <- kickingStats_weekly %>% 
  select(player_id, season, week, everything())

# Offense: No duplicate id-season-week combinations
offenseStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1)
#> # A tibble: 0 × 53
#> # Groups:   player_id, season, week [0]
#> # ℹ 53 variables: player_id <chr>, season <int>, week <int>, player_name <chr>,
#> #   player_display_name <chr>, position <chr>, position_group <chr>,
#> #   headshot_url <chr>, recent_team <chr>, season_type <chr>,
#> #   opponent_team <chr>, completions <int>, attempts <int>,
#> #   passing_yards <dbl>, passing_tds <int>, interceptions <dbl>, sacks <dbl>,
#> #   sack_yards <dbl>, sack_fumbles <int>, sack_fumbles_lost <int>,
#> #   passing_air_yards <dbl>, passing_yards_after_catch <dbl>, …

# Defense
defenseStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1)
#> # A tibble: 496 × 32
#> # Groups:   player_id, season, week [183]
#>    player_id season  week season_type player_name player_display_name position
#>    <chr>      <int> <int> <chr>       <chr>       <chr>               <chr>   
#>  1 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  2 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  3 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  4 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  5 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  6 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  7 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  8 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  9 0           1999     2 REG         <NA>        <NA>                <NA>    
#> 10 0           1999     2 REG         <NA>        <NA>                <NA>    
#> # ℹ 486 more rows
#> # ℹ 25 more variables: position_group <chr>, headshot_url <chr>, team <chr>,
#> #   def_tackles <int>, def_tackles_solo <int>, def_tackles_with_assist <int>,
#> #   def_tackle_assists <int>, def_tackles_for_loss <int>,
#> #   def_tackles_for_loss_yards <dbl>, def_fumbles_forced <int>,
#> #   def_sacks <dbl>, def_sack_yards <dbl>, def_qb_hits <dbl>,
#> #   def_interceptions <dbl>, def_interception_yards <dbl>, …

defenseStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1, player_id != 0) #not sure why there are playerIDs of "0"; exclude them
#> # A tibble: 296 × 32
#> # Groups:   player_id, season, week [148]
#>    player_id  season  week season_type player_name player_display_name position
#>    <chr>       <int> <int> <chr>       <chr>       <chr>               <chr>   
#>  1 00-0002919   1999     4 REG         <NA>        Corey Chavous       SS      
#>  2 00-0002919   1999     4 REG         <NA>        Corey Chavous       SS      
#>  3 00-0004543   1999    12 REG         <NA>        Shane Dronett       DT      
#>  4 00-0004543   1999    12 REG         <NA>        Shane Dronett       DT      
#>  5 00-0004915   1999    16 REG         <NA>        Bobby Engram        WR      
#>  6 00-0004915   1999    16 REG         <NA>        Bobby Engram        WR      
#>  7 00-0010668   1999    20 POST        <NA>        Keenan McCardell    WR      
#>  8 00-0010668   1999    20 POST        <NA>        Keenan McCardell    WR      
#>  9 00-0011392   1999    14 REG         <NA>        Basil Mitchell      RB      
#> 10 00-0011392   1999    14 REG         <NA>        Basil Mitchell      RB      
#> # ℹ 286 more rows
#> # ℹ 25 more variables: position_group <chr>, headshot_url <chr>, team <chr>,
#> #   def_tackles <int>, def_tackles_solo <int>, def_tackles_with_assist <int>,
#> #   def_tackle_assists <int>, def_tackles_for_loss <int>,
#> #   def_tackles_for_loss_yards <dbl>, def_fumbles_forced <int>,
#> #   def_sacks <dbl>, def_sack_yards <dbl>, def_qb_hits <dbl>,
#> #   def_interceptions <dbl>, def_interception_yards <dbl>, …

# Kicking

kickingStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1)
#> # A tibble: 4 × 44
#> # Groups:   player_id, season, week [2]
#>   player_id  season  week season_type team  player_name player_display_name
#>   <chr>       <int> <int> <chr>       <chr> <chr>       <chr>              
#> 1 00-0004811   2000    11 REG         DEN   <NA>        Jason Elam         
#> 2 00-0004811   2000    11 REG         LV    <NA>        Jason Elam         
#> 3 00-0012875   2002     4 REG         PIT   <NA>        Todd Peterson      
#> 4 00-0012875   2002     4 REG         PIT   <NA>        Todd Peterson      
#> # ℹ 37 more variables: position <chr>, position_group <chr>,
#> #   headshot_url <chr>, fg_made <int>, fg_att <dbl>, fg_missed <int>,
#> #   fg_blocked <int>, fg_long <dbl>, fg_pct <dbl>, fg_made_0_19 <int>,
#> #   fg_made_20_29 <int>, fg_made_30_39 <int>, fg_made_40_49 <int>,
#> #   fg_made_50_59 <int>, fg_made_60_ <int>, fg_missed_0_19 <int>,
#> #   fg_missed_20_29 <int>, fg_missed_30_39 <int>, fg_missed_40_49 <int>,
#> #   fg_missed_50_59 <int>, fg_missed_60_ <int>, fg_made_list <chr>, …

sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4    nflreadr_1.4.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.5       cli_3.6.3         knitr_1.48        rlang_1.1.4      
#>  [5] xfun_0.46         generics_0.1.3    data.table_1.15.4 glue_1.7.0       
#>  [9] htmltools_0.5.8.1 fansi_1.0.6       rmarkdown_2.27    evaluate_0.24.0  
#> [13] tibble_3.2.1      fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
#> [17] memoise_2.0.1     compiler_4.3.1    fs_1.6.4          pkgconfig_2.0.3  
#> [21] rstudioapi_0.16.0 digest_0.6.36     R6_2.5.1          tidyselect_1.2.1 
#> [25] reprex_2.1.1      utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3   
#> [29] tools_4.3.1       withr_3.0.0       cachem_1.1.0

Created on 2024-07-31 with reprex v2.1.1

Expected Behavior

I expect each player (i.e., player_id) to have only one row for a given season-week combination.

nflverse_sitrep

> nflreadr::nflverse_sitrep()
── System Info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• R version 4.3.1 (2023-06-16 ucrt) • Running under: Windows 11 x64 (build 22631)
── Package Status ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   package installed  cran        dev behind
1   nfl4th     1.0.4 1.0.4 1.0.4.9002    dev
2 nflfastR     4.6.1 4.6.1 4.6.1.9010    dev
3 nflplotR     1.3.1 1.3.1      1.3.1       
4 nflreadr     1.4.1 1.4.1   1.4.1.00       
── Package Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• No options set for above packages
── Package Dependencies ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• askpass     (1.2.0)    • httr         (1.4.7)   • stringi     (1.8.4)       
• backports   (1.5.0)    • isoband      (0.2.7)   • stringr     (1.5.1)       
• base64enc   (0.1-3)    • janitor      (2.2.0)   • sys         (3.4.2)       
• bigD        (0.2.0)    • jquerylib    (0.1.4)   • tibble      (3.2.1)       
• bitops      (1.0-8)    • jsonlite     (1.8.8)   • tidyr       (1.3.1)       
• bslib       (0.8.0)    • juicyjuice   (0.1.0)   • tidyselect  (1.2.1)       
• cachem      (1.1.0)    • knitr        (1.48)    • timechange  (0.3.0)       
• cli         (3.6.3)    • labeling     (0.4.3)   • tinytex     (0.52)        
• colorspace  (2.1-1)    • lifecycle    (1.0.4)   • utf8        (1.2.4)       
• commonmark  (1.9.1)    • listenv      (0.9.1)   • V8          (4.4.2)       
• cpp11       (0.4.7)    • lubridate    (1.9.3)   • vctrs       (0.6.5)       
• curl        (5.2.1)    • magick       (2.8.4)   • viridisLite (0.4.2)       
• data.table  (1.15.4)   • magrittr     (2.0.3)   • withr       (3.0.0)       
• digest      (0.6.36)   • markdown     (1.13)    • xfun        (0.46)        
• dplyr       (1.1.4)    • Matrix       (1.6-5)   • xgboost     (1.7.8.1)     
• evaluate    (0.24.0)   • memoise      (2.0.1)   • xml2        (1.3.6)       
• fansi       (1.0.6)    • mime         (0.12)    • yaml        (2.3.10)      
• farver      (2.1.2)    • munsell      (0.5.1)   • codetools   (0.2-20)      
• fastmap     (1.2.0)    • openssl      (2.2.0)   • compiler    (4.3.1)       
• fastrmodels (1.0.2)    • parallelly   (1.38.0)  • graphics    (4.3.1)       
• fontawesome (0.5.2)    • pillar       (1.9.0)   • grDevices   (4.3.1)       
• fs          (1.6.4)    • pkgconfig    (2.0.3)   • grid        (4.3.1)       
• furrr       (0.3.1)    • progressr    (0.14.0)  • lattice     (0.22-6)      
• future      (1.34.0)   • purrr        (1.0.2)   • MASS        (7.3-60.0.1)  
• generics    (0.1.3)    • R6           (2.5.1)   • Matrix      (1.6-5)       
• ggpath      (1.0.1)    • rappdirs     (0.3.3)   • methods     (4.3.1)       
• ggplot2     (3.5.1)    • RColorBrewer (1.1-3)   • mgcv        (1.9-1)       
• globals     (0.16.3)   • Rcpp         (1.0.13)  • nlme        (3.1-165)     
• glue        (1.7.0)    • reactable    (0.4.4)   • parallel    (4.3.1)       
• gt          (0.11.0)   • reactR       (0.6.0)   • splines     (4.3.1)       
• gtable      (0.3.5)    • rlang        (1.1.4)   • stats       (4.3.1)       
• highr       (0.11)     • rmarkdown    (2.27)    • tools       (4.3.1)       
• hms         (1.1.3)    • sass         (0.4.9)   • utils       (4.3.1)       
• htmltools   (0.5.8.1)  • scales       (1.3.0)     
• htmlwidgets (1.6.4)    • snakecase    (0.11.1)    
── Not Installed ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• nflseedR ()
• nflverse ()

Screenshots

No response

Additional context

No response

tanho63 commented 3 months ago

Relocating to nflfastR repo

mrcaseb commented 3 months ago

Looking at the problematic defense data. It seems like players get attributed to the opponent team in some cases when they get a fumble recovery or penalty.

CORRECTION: I think we assign tackles after turnovers to the wrong team

So the main thing might be that an offensive player scores a defensive stat after the offense turned over the ball

mrcaseb commented 3 months ago

This might be quite hard to fix and we should probably invest the time in #470 instead

mrcaseb commented 3 weeks ago

We will deprecate calculate_player_stats_*() functions in a future release. The new function calculate_stats() (https://github.com/nflverse/nflfastR/pull/470 ) will fix the issue