nflverse / nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data
https://www.nflfastr.com/
Other
425 stars 52 forks source link

More Team Name Issues #29

Closed TheMathNinja closed 4 years ago

TheMathNinja commented 4 years ago

I'm noticing the following naming inconsistencies in this dataset:

The away_team and posteam variables use "LV" for Raiders, but game_id is OAK, side_of_field is OAK, yrdline is OAK, solo_tackle_1_team is OAK, fumbled_1_team is OAK, end_yard_line is OAK. drive_start and drive_end yard lines are also OAK.

Looking at 2016_04_NO_SD: game_id is SD, home_team and posteam are LAC, side_of_field is SD, yrdline is SD, fumbled_1_team is SD, fumble_recovery_1 team is SD, drive_start and drive_end are SD.

Looking at 2015_01_SEA_STL we see the same issue at work. "away_team" and posteam are LA but everything else is STL.

2015_13_JAC_TEN has the same issue. away_team (and in other games home_team) and posteam are JAX but everything else is JAC.

I picked out isolated games but these are systemic issues.

guga31bb commented 4 years ago

This should be easy to fix- just need to add more columns to this function here. Is this the extra columns that need changing? (we can't change game_id because that would mess things up)

TheMathNinja commented 4 years ago

Ah yes thanks for helping me find the function. Here are the functions that need adding (I hope I caught them all):

guga31bb commented 4 years ago

Some of these aren't just a team abbreviation by itself, right? Eg the yard line stuff

TheMathNinja commented 4 years ago

That’s correct. Only the 4 play and drive yard line stats are team abbreviation + number. The rest are just team name.

guga31bb commented 4 years ago

Thank you! This has been fixed aside from the yard line ones.

guga31bb commented 4 years ago

If anyone wants to help on this issue, here's an illustration of what's left to do:

games <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2015.rds')) %>%
  filter(home_team == 'LAC' | away_team == 'LAC')

games %>%
  select(posteam, defteam, yrdln, end_yard_line, drive_start_yard_line, drive_end_yard_line)

# A tibble: 2,948 x 6
   posteam defteam yrdln  end_yard_line drive_start_yard_line drive_end_yard_line
   <chr>   <chr>   <chr>  <chr>         <chr>                 <chr>              
 1 NA      NA      CLE 10 NA            NA                    NA                 
 2 DET     LAC     SD 35  DET 25        DET 20                SD 24              
 3 DET     LAC     DET 20 DET 21        DET 20                SD 24              
 4 DET     LAC     DET 21 DET 32        DET 20                SD 24              
 5 DET     LAC     DET 32 DET 39        DET 20                SD 24              
 6 DET     LAC     DET 39 DET 39        DET 20                SD 24              
 7 DET     LAC     DET 39 SD 33         DET 20                SD 24              
 8 DET     LAC     SD 33  SD 33         DET 20                SD 24              
 9 DET     LAC     SD 33  SD 24         DET 20                SD 24              
10 DET     LAC     SD 24  NA            DET 20                SD 24              
# ... with 2,938 more rows

This function is what we use to standardize all the team abbreviations when they appear alone (e.g. "LAC"), but the 4 yard line columns above are more complicated so we'd need a new function that did something along the lines of split up the abbreviation from yard line, change abbreviation (probably using the function we already made), and then put the changed team name back with yard line.

awgymer commented 4 years ago

You could try something like:

team_name_fn2 <- function(var){
    stringi::stri_replace_all_fixed(
        var, 
        pattern=c("JAC", "STL", "SL", "ARZ", "BLT", "CLV", "HST", "SD", "OAK"), 
        replacement=c("JAX", "LA", "LA", "ARI", "BAL", "CLE", "HOU", "LAC", "LV"),
        vectorize_all=F
    )
}

games %>% dplyr::mutate_at(dplyr::vars("posteam", "defteam", "yrdln", "end_yard_line", "drive_start_yard_line", "drive_end_yard_line"), team_name_fn2) %>% select(posteam, defteam, yrdln, end_yard_line, drive_start_yard_line, drive_end_yard_line)

# A tibble: 2,948 x 6
   posteam defteam yrdln  end_yard_line drive_start_yard_line drive_end_yard_line
   <chr>   <chr>   <chr>  <chr>         <chr>                 <chr>              
 1 NA      NA      CLE 10 NA            NA                    NA                 
 2 DET     LAC     LAC 35 DET 25        DET 20                LAC 24             
 3 DET     LAC     DET 20 DET 21        DET 20                LAC 24             
 4 DET     LAC     DET 21 DET 32        DET 20                LAC 24             
 5 DET     LAC     DET 32 DET 39        DET 20                LAC 24             
 6 DET     LAC     DET 39 DET 39        DET 20                LAC 24             
 7 DET     LAC     DET 39 LAC 33        DET 20                LAC 24             
 8 DET     LAC     LAC 33 LAC 33        DET 20                LAC 24             
 9 DET     LAC     LAC 33 LAC 24        DET 20                LAC 24             
10 DET     LAC     LAC 24 NA            DET 20                LAC 24             
# … with 2,938 more rows
guga31bb commented 4 years ago

Thank you @awgymer! I've put this into the next release and credited you in the notes above the function.