Closed TheMathNinja closed 4 years ago
This should be easy to fix- just need to add more columns to this function here. Is this the extra columns that need changing? (we can't change game_id because that would mess things up)
Ah yes thanks for helping me find the function. Here are the functions that need adding (I hope I caught them all):
Some of these aren't just a team abbreviation by itself, right? Eg the yard line stuff
That’s correct. Only the 4 play and drive yard line stats are team abbreviation + number. The rest are just team name.
Thank you! This has been fixed aside from the yard line ones.
If anyone wants to help on this issue, here's an illustration of what's left to do:
games <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2015.rds')) %>%
filter(home_team == 'LAC' | away_team == 'LAC')
games %>%
select(posteam, defteam, yrdln, end_yard_line, drive_start_yard_line, drive_end_yard_line)
# A tibble: 2,948 x 6
posteam defteam yrdln end_yard_line drive_start_yard_line drive_end_yard_line
<chr> <chr> <chr> <chr> <chr> <chr>
1 NA NA CLE 10 NA NA NA
2 DET LAC SD 35 DET 25 DET 20 SD 24
3 DET LAC DET 20 DET 21 DET 20 SD 24
4 DET LAC DET 21 DET 32 DET 20 SD 24
5 DET LAC DET 32 DET 39 DET 20 SD 24
6 DET LAC DET 39 DET 39 DET 20 SD 24
7 DET LAC DET 39 SD 33 DET 20 SD 24
8 DET LAC SD 33 SD 33 DET 20 SD 24
9 DET LAC SD 33 SD 24 DET 20 SD 24
10 DET LAC SD 24 NA DET 20 SD 24
# ... with 2,938 more rows
This function is what we use to standardize all the team abbreviations when they appear alone (e.g. "LAC
"), but the 4 yard line columns above are more complicated so we'd need a new function that did something along the lines of split up the abbreviation from yard line, change abbreviation (probably using the function we already made), and then put the changed team name back with yard line.
You could try something like:
team_name_fn2 <- function(var){
stringi::stri_replace_all_fixed(
var,
pattern=c("JAC", "STL", "SL", "ARZ", "BLT", "CLV", "HST", "SD", "OAK"),
replacement=c("JAX", "LA", "LA", "ARI", "BAL", "CLE", "HOU", "LAC", "LV"),
vectorize_all=F
)
}
games %>% dplyr::mutate_at(dplyr::vars("posteam", "defteam", "yrdln", "end_yard_line", "drive_start_yard_line", "drive_end_yard_line"), team_name_fn2) %>% select(posteam, defteam, yrdln, end_yard_line, drive_start_yard_line, drive_end_yard_line)
# A tibble: 2,948 x 6
posteam defteam yrdln end_yard_line drive_start_yard_line drive_end_yard_line
<chr> <chr> <chr> <chr> <chr> <chr>
1 NA NA CLE 10 NA NA NA
2 DET LAC LAC 35 DET 25 DET 20 LAC 24
3 DET LAC DET 20 DET 21 DET 20 LAC 24
4 DET LAC DET 21 DET 32 DET 20 LAC 24
5 DET LAC DET 32 DET 39 DET 20 LAC 24
6 DET LAC DET 39 DET 39 DET 20 LAC 24
7 DET LAC DET 39 LAC 33 DET 20 LAC 24
8 DET LAC LAC 33 LAC 33 DET 20 LAC 24
9 DET LAC LAC 33 LAC 24 DET 20 LAC 24
10 DET LAC LAC 24 NA DET 20 LAC 24
# … with 2,938 more rows
Thank you @awgymer! I've put this into the next release and credited you in the notes above the function.
I'm noticing the following naming inconsistencies in this dataset:
The away_team and posteam variables use "LV" for Raiders, but game_id is OAK, side_of_field is OAK, yrdline is OAK, solo_tackle_1_team is OAK, fumbled_1_team is OAK, end_yard_line is OAK. drive_start and drive_end yard lines are also OAK.
Looking at 2016_04_NO_SD: game_id is SD, home_team and posteam are LAC, side_of_field is SD, yrdline is SD, fumbled_1_team is SD, fumble_recovery_1 team is SD, drive_start and drive_end are SD.
Looking at 2015_01_SEA_STL we see the same issue at work. "away_team" and posteam are LA but everything else is STL.
2015_13_JAC_TEN has the same issue. away_team (and in other games home_team) and posteam are JAX but everything else is JAC.
I picked out isolated games but these are systemic issues.