nflverse / nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data
https://www.nflfastr.com/
Other
414 stars 50 forks source link

Missing Plays in NFLFastR #35

Open jacole3 opened 4 years ago

jacole3 commented 4 years ago

Via email communication with Ben Baldwin, I've learned that one of the plans for the next NFLFastR update is to locate and fix as many of the missing plays as possible. Many, but not all, that I've found come in the 1999 season. Here's a non-comprehensive list of what I've found so far, and anyone who finds more is welcome to pitch in to help Ben and his colleagues out.

# Two plays are missing here shortly after onside kick recovery View(pbp %>% filter(game_id == "1999_01_DAL_WAS", qtr == 4))

# One play missing here, after Batch is sacked View(pbp %>% filter(game_id == "1999_16_DEN_DET", qtr == 4))

# Missing play here, after a McNown completion View(pbp %>% filter(game_id == "1999_08_CHI_WAS", qtr == 4))

# Missing play here, after completion to Metcalf late in qtr View(pbp %>% filter(game_id == "1999_13_STL_CAR", qtr == 4))

# Missing play here, after incompletion from Plummer to Moore View(pbp %>% filter(game_id == "1999_16_ARI_ATL", qtr == 4))

# Missing play here, after incompletion to Bates View(pbp %>% filter(game_id == "1999_16_CHI_STL", qtr == 4))

# Missing play here, after incompletion to Batten View(pbp %>% filter(game_id == "1999_16_MIN_NYG", qtr == 4))

# Missing play here, after Walsh completion on 3rd down View(pbp %>% filter(game_id == "1999_17_IND_BUF", qtr == 4))

# Two missing plays here, after late McNown incompletion View(pbp %>% filter(game_id == "1999_17_TB_CHI", qtr == 4))

# Missing play after a Frerotte incompletion View(pbp %>% filter(game_id == "1999_18_DET_WAS", qtr == 4))

# Missing play after a late McNown completion View(pbp %>% filter(game_id == "2000_06_NO_CHI", qtr == 4))

# Missing play here, after a Kordell Stewart completion View(pbp %>% filter(game_id == "1999_01_PIT_CLE", qtr == 1))

# Missing play here, after 19-yard Smith completion (8:30) View(pbp %>% filter(game_id == "1999_04_STL_CIN", qtr == 3))

# Missing play here, shortly before 8:22 FG (right before Dillon run) View(pbp %>% filter(game_id == "1999_05_CIN_CLE", qtr == 1))

# Missing play here, right after 11:25 DPI View(pbp %>% filter(game_id == "1999_08_TB_DET", qtr == 4))

# Missing play here, right before Garrett delay of game View(pbp %>% filter(game_id == "1999_11_DAL_ARI", qtr == 2))

# Missing play here, before the 4:27 DPI on Favre pass View(pbp %>% filter(game_id == "1999_13_GB_CHI", qtr == 2))

# Missing play here after 23-yard completion to Rice (8:07) View(pbp_Original %>% filter(game_id == "1999_13_SF_CIN", qtr == 4))

# Missing play here right before final play of 1st qtr View(pbp %>% filter(game_id == "1999_14_ARI_WAS", qtr == 1))

# Missing play here, right before 2:00 Makovicka completion View(pbp %>% filter(game_id == "1999_14_ARI_WAS", qtr == 4))

# Missing play after Flutie 4:17 pass to Thomas View(pbp %>% filter(game_id == "1999_14_NYG_BUF", qtr == 1))

# Missing play right before 11:02 encroachment on Chargers View(pbp %>% filter(game_id == "1999_14_SD_SEA", qtr == 1))

# Missing play right before 3:02 NE ineligible man downfield View(pbp %>% filter(game_id == "1999_15_NE_PHI", qtr == 2))

# Missing play right after 11:54 Peyton to Edgerrin completion View(pbp %>% filter(game_id == "1999_15_WAS_IND", qtr == 3))

# Missing play right after 13:40 Chandler-to-Mathis completion View(pbp %>% filter(game_id == "1999_17_SF_ATL", qtr == 2))

# Missing play right before 6:22 Warren run View(pbp %>% filter(game_id == "1999_18_DAL_MIN", qtr == 4))

# Missing play right after 6:20 Chandler-to-Mathis completion View(pbp %>% filter(game_id == "2000_01_SF_ATL", qtr == 1))

# Missing play right after 9:19 Davis run View(pbp %>% filter(game_id == "2000_03_DAL_WAS", qtr == 4))

# Missing play (a spike) right before final offensive play of half View(pbp %>% filter(game_id == "2003_02_PIT_KC", qtr == 2))

# Missing play (a spike) right before Kaeding's miss View(pbp %>% filter(game_id == "2006_19_NE_SD", qtr == 4))

guga31bb commented 4 years ago

Thanks!

I'm wondering how many of these are still present in the new version of nflfastR. I know the first two should be fixed but haven't taken a look at the others.

jacole3 commented 4 years ago

@guga31bb I've gone through my list, and I still see the following seven errors. Keep in mind that my original list of 30 wasn't comprehensive, just the ones that I had come across over the past few days.

# One play missing here, now the 2nd down right BEFORE Batch is sacked, 13:27 View(pbp %>% filter(game_id == "1999_16_DEN_DET", qtr == 4)) # This one is really ironic, because in version 2.1.0, the 4th down right after the sack was missing.

# Missing play here, shortly before 8:22 FG (right before Dillon run) View(pbp %>% filter(game_id == "1999_05_CIN_CLE", qtr == 1))

# Missing play here, right after 11:25 DPI View(pbp %>% filter(game_id == "1999_08_TB_DET", qtr == 4))

# Missing play right before 3:02 NE ineligible man downfield View(pbp %>% filter(game_id == "1999_15_NE_PHI", qtr == 2))

# Missing play right after 13:40 Chandler-to-Mathis completion View(pbp %>% filter(game_id == "1999_17_SF_ATL", qtr == 2))

# Missing play right before 6:22 Warren run View(pbp %>% filter(game_id == "1999_18_DAL_MIN", qtr == 4))

# Missing play right after 9:19 Davis run View(pbp %>% filter(game_id == "2000_03_DAL_WAS", qtr == 4))

guga31bb commented 4 years ago

Illustration of problem:

id = "1999_18_DAL_MIN"
g <- get_pbp_gc(id) %>%
  add_game_data() 
g %>% select(play_description, game_id, play_id, down, yards_to_go, quarter, time, drive)%>%View

image

This is really hard because the duplicate description, ID, and time make this impossible to distinguish from a duplicate play (and we need to drop dups to fix other games). The good news is that this is only a problem in older seasons, but it is annoying.

jacole3 commented 4 years ago

@guga31bb Interesting, that makes sense. I definitely noticed that the majority of omitted plays were incompletions to a receiver who already had an incompletion thrown to him at some other point in the same drive. So I guess that means there's no way to fix it besides going through and adding all the missing ones by hand.

If I find any more besides the seven mentioned in my last comment, I'll share them here. Hopefully we can detect most, or all, of them by the time Version 2.1.2 drops.

CroppedClamp commented 4 years ago

I'm wondering how these match up with the errata linked here: https://github.com/CroppedClamp/nflscrapR-data/tree/master/errata. There are some cases where plays are listed out of order, and some where the stats are wrong. I think there are few enough that it could be corrected by hand though.

One way to further verify these is to run the stats that are aggregated from the PBPs here against the official NFL stats, which I have done a couple times. I have noticed similar small issues, maybe 3 or 4 plays off for the whole season, and definitely some that are out of order. In some cases, these are even incorrect in the official NFL GSIS feed, the XML feed, and also the gamebooks (this seems to happen a lot on kickoff return fumbles). I don't see a great way of computationally coalescing these other than to find the error programmatically and then fix it by hand, since it is variable in a number of different sources.

Let me know how I can help in this area as well, if I can

jacole3 commented 4 years ago

@CroppedClamp Good thought to look back at the NFLScrapR errors, there's gotta be some overlap with the 2009-and-later seasons in terms of these types of mistakes. At first glance, I found some mutual mistakes in the 2011 Lions-Saints game, involving some plays being out of order in the late 2nd and early 3rd quarters. So it looks like that link could help us find further NFLFastR errors.

Unfortunately, even with that knowledge, it seems like fixing whatever errors we find by hand is the only way to go. Though I'm sure Ben and Sebastian have better insight on that. In any case, we can help them out by posting whatever missing plays we do find in this thread.