sportsdataverse / cfbfastR

An R package to quickly obtain clean and tidy college football play by play data
https://cfbfastR.sportsdataverse.org
Other
74 stars 8 forks source link

Incorrect values recorded for drive points #76

Closed john-b-edwards closed 2 years ago

john-b-edwards commented 2 years ago

cfbfastR::cfbd_pbp_data() reports erroneous values for the number of points scored on a drive for some plays. For example:

cfbfastR::cfbd_pbp_data(year = 2021, season_type = "regular", week = 3) |>
  dplyr::filter(play_type == "Defensive 2pt Conversion") |>
  dplyr::pull(drive_pts)
#> [1] 7

I have a dataframe of all plays loaded in my local environment and the only possible values for drive_pts appear to be -7, -2, 0, 3, or 7 -- impossible given 1) missed extra points resulting in -6 and 6, 2) successful and missed two-point conversions resulting in -8, -6, 6, or 8, and 3) defensive 2 point and 1 point conversions, resulting in -5, -4, 4, and 5. Some of these possibilities are too rare to have occurred in the dataset, but others -- like 8 -- should be present in the dataframe but are not.

akeaswaran commented 2 years ago

For anyone that picks this up, I believe drive_pts is coming from the new_drive_pts column here: https://github.com/sportsdataverse/cfbfastR/blob/4c02a4a4dbc73bf9cf4a29493a234a6fc93593f4/R/cfbd_pbp_data.R#L1407-L1445

You'd have to grab the result of the PAT play via a lead or via the play_text and adjust those values accordingly.

Kazink36 commented 2 years ago

Looks like drive_pts is actually coming from here based on the drives endpoint form cfbd: https://github.com/sportsdataverse/cfbfastR/blob/4c02a4a4dbc73bf9cf4a29493a234a6fc93593f4/R/cfbd_pbp_data.R#L2041-L2059

The drive endpoint also includes the starting and ending scores for both offense and defense which could (and probably should?) be used to generate the drive points instead. I'm just thinking about what makes sense when both teams score points (like the defensive conversion) and I think your suggestion of -5, -4, 4, and 5 makes sense as a column showing the change in score differential