nflverse / nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data
https://www.nflfastr.com/
Other
414 stars 50 forks source link

calculate_win_probability potential issue #464

Closed marcusSasser closed 5 months ago

marcusSasser commented 5 months ago

Is there an existing issue for this?

Have you installed the latest development version of the package(s) in question?

If this is a data issue, have you tried clearing your nflverse cache?

I have cleared my nflverse cache and the issue persists.

What version of the package do you have?

4.6.1

Describe the bug

I've been using the calculate_win_probability command with my own file of play-by-play data from Excel (I'm not preloading NFL games in the nflfastR package), there seems to be an error in outputting win probability. When I, for instance, give the home team the ball to start the game and have them go four and out, the win probability results it spits out are as follows: 0.5462618, 0.5280299, 0.4983190, 0.4617070

The exact numbers are not necessarily important, but the downward trend is. That makes sense in this instance, since win probability it gives you appears to be the home team's probability to win, and with each play that gains 0 yards, you'd expect a downtick in % chance to win. When I give the away team the ball with the same clock, distance, timeouts, etc it gives me this: 0.5497268, 0.5342795, 0.5104250, 0.4668941

Again, the exact numbers used in each column would change the exact output I get, but what's important is having them go four and out with 0 yards gained, which I did just like before. The trend is again downward, but that makes no sense here, as if it gives win probability from the perspective of the home team, then their chance to win should increase by forcing the opponent to punt the ball while giving up 0 yards. I've noticed this several times so it is not just the specific example of data I input.

This is either 1) a formula error and there is something that need to be adjusted with respect to the relationship between downs and possession, or 2) my own personal misunderstanding on what this result is supposed to be. I could not find a concrete explanation as to what the function calculate_win_probability is supposed to spit out, but it does appear to want to give you the home team's % to win. Would have no real idea where specifically this problem lies in the code, as that is far from my expertise, but any help would be appreciated.

Reprex

calculate_win_probability(Excel_file_name)$wp

Expected Behavior

WP would be expected to look like it does for the home team going four and out, but the reverse for the away team going for and out, something like: 0.5, 0.51, 0.52, 0.54

nflverse_sitrep

> nflverse_sitrep()
── System Info ──────────────────────────────────────────────────────────────────
• R version 4.3.2 (2023-10-31) • Running under: macOS Big Sur 11.7.10
── Package Status ───────────────────────────────────────────────────────────────
   package installed  cran        dev behind
1 nflfastR     4.6.1 4.6.1 4.6.1.9008    dev
2 nflreadr     1.4.0 1.4.0   1.4.0.12    dev
── Package Options ──────────────────────────────────────────────────────────────
• No options set for above packages
── Package Dependencies ─────────────────────────────────────────────────────────
• cachem      (1.0.8)    • grid       (4.3.2)    • purrr      (1.0.2)    
• cli         (3.6.2)    • hms        (1.1.3)    • R6         (2.5.1)    
• codetools   (0.2-19)   • janitor    (2.2.0)    • rappdirs   (0.3.3)    
• compiler    (4.3.2)    • jsonlite   (1.8.8)    • rlang      (1.1.3)    
• cpp11       (0.4.7)    • lattice    (0.22-5)   • snakecase  (0.11.1)   
• curl        (5.2.0)    • lifecycle  (1.0.4)    • splines    (4.3.2)    
• data.table  (1.14.10)  • listenv    (0.9.1)    • stats      (4.3.2)    
• digest      (0.6.34)   • lubridate  (1.9.3)    • stringi    (1.8.3)    
• dplyr       (1.1.4)    • magrittr   (2.0.3)    • stringr    (1.5.1)    
• fansi       (1.0.6)    • Matrix     (1.6-4)    • tibble     (3.2.1)    
• fastmap     (1.1.1)    • memoise    (2.0.1)    • tidyr      (1.3.1)    
• fastrmodels (1.0.2)    • methods    (4.3.2)    • tidyselect (1.2.0)    
• furrr       (0.3.1)    • mgcv       (1.9-1)    • timechange (0.3.0)    
• future      (1.33.1)   • nlme       (3.1-164)  • tools      (4.3.2)    
• generics    (0.1.3)    • parallel   (4.3.2)    • utf8       (1.2.4)    
• globals     (0.16.2)   • parallelly (1.36.0)   • utils      (4.3.2)    
• glue        (1.7.0)    • pillar     (1.9.0)    • vctrs      (0.6.5)    
• graphics    (4.3.2)    • pkgconfig  (2.0.3)    • withr      (3.0.0)    
• grDevices   (4.3.2)    • progressr  (0.14.0)   • xgboost    (1.7.7.1)  
── Not Installed ────────────────────────────────────────────────────────────────
• nflseedR  • nflplotR    
• nfl4th    • nflverse    
─────────────────────────────────────────────────────────────────────────────────

Screenshots

No response

Additional context

No response

mrcaseb commented 5 months ago

All of nflfastR's variables are explained in this searchable table

In this table you'll find the following definition of wp

Estimated win probabiity for the posteam given the current situation at the start of the given play.

The win probability calculated with the function you used is always the win probability of the possession team which explains why the downward trend in both cases makes sense.

marcusSasser commented 5 months ago

Okay thank you very much. I could not find any specific mention for what it's goal was to calculate, so sorry for any inconvenience.

mrcaseb commented 5 months ago

Okay thank you very much. I could not find any specific mention for what it's goal was to calculate, so sorry for any inconvenience.

No worries. It's probably a sign we should adjust the function documentation to make the output more clear