nflverse / nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data
https://www.nflfastr.com/
Other
425 stars 52 forks source link

Populate wind and temp from weather #303

Closed alecglen closed 1 year ago

alecglen commented 2 years ago

A lot of data in the play_by_play wind and temp data is missing, including all 2021 games. As @guga31bb mentioned in https://github.com/nflverse/nflfastR-data/issues/32, it's probably just due to what the NFL provides.

However, for most of these cases, the data to populate the columns are readily available in weather. Just wanted to put it out there that these could be backfilled. image

alecglen commented 2 years ago

I don't use R or I'd open a PR, but FWIW here's the algorithm/regex I'm using locally with good results.

def parse_precipitation(play):
    return (
        'Rain' in play['weather'] or
        'rain' in play['weather'] or
        'Snow' in play['weather'] or
        'snow' in play['weather']
    )

def parse_temperature(play):
    if not np.isnan(play['temp']):
        return play['temp']
    if play['weather']:
        match = re.search('Temp: (\d+)°', play['weather'])
        if match:
            return int(match.group(1))

def parse_wind(play):
    if not np.isnan(play['wind']):
        return play['wind']
    if play['weather']:
        match = re.search('Wind:.* (\d+) ', play['weather'])
        if match:
            return int(match.group(1))
mrcaseb commented 2 years ago

This is example code to extract the parts of the weather string

df <- pbp |> 
  dplyr::filter(!is.na(weather)) |> 
  dplyr::distinct(season, game_id, weather) |> 
  dplyr::mutate(
    temp_f = dplyr::case_when(
      str_detect(weather, "Indoors") ~ NA_character_,
      TRUE ~ str_extract(weather, "(?<=Temp: )-?[:digit:]{1,3}")
    ),
    temp_f = as.numeric(temp_f),
    temp_c = (temp_f - 32) * 5 / 9,
    hum = str_extract(weather, "(?<=Humidity: )[:digit:]{1,3}"),
    hum = as.numeric(hum) / 100,
    wind = str_extract(weather, "(?<=Wind: ).+(?= mph)") |> str_trim()
  ) |> 
  dplyr::na_if("") |> 
  dplyr::filter(!(is.na(temp_f) & is.na(hum) & is.na(wind)))

df
# A tibble: 2,723 x 7
   game_id         season weather                                                                 temp_f temp_c   hum wind         
   <chr>            <int> <chr>                                                                    <dbl>  <dbl> <dbl> <chr>        
 1 2001_01_ATL_SF    2001 partly cloudy Temp: 68° F, Humidity: 63%, Wind: Southwest 12 MPH mph        68   20    0.63 Southwest 12~
 2 2001_01_CAR_MIN   2001 Temp: 65° F, Wind:   mph                                                    65   18.3 NA    NA           
 3 2001_01_CHI_BAL   2001 Mostly cloudy, highs in mid 80's Temp: 83° F, Humidity: 66%, Wind: Sou~     83   28.3  0.66 South 10     
 4 2001_01_DET_GB    2001 Rain throughout game, heavy showers possible. Temp: 60° F, Humidity: 9~     60   15.6  0.93 NW 5         
 5 2001_01_IND_NYJ   2001 Partly Sunny Temp: 81° F, Humidity: 81%, Wind: SW 6 mph mph                 81   27.2  0.81 SW 6 mph     
 6 2001_01_MIA_TEN   2001 Partly Cloudy & Windy Temp: 81° F, Humidity: 69%, Wind: From the South~     81   27.2  0.69 From the Sou~
 7 2001_01_NE_CIN    2001 Partly clooudy, poossible showers/thunderstorms Temp: 79° F, Humidity:~     79   26.1  0.87 S 8          
 8 2001_01_NO_BUF    2001 Sunny Temp: 87° F, Humidity: 52%, Wind: SW 10 mph                           87   30.6  0.52 SW 10        
 9 2001_01_NYG_DEN   2001 Clear Temp: 75° F, Humidity: 18%, Wind: SE 9 mph                            75   23.9  0.18 SE 9         
10 2001_01_OAK_KC    2001 Mostly Sunny Temp: 64° F, Humidity: 78%, Wind: Northwest 12 mph             64   17.8  0.78 Northwest 12 
# ... with 2,713 more rows

weather.csv

mrcaseb commented 1 year ago

The pbp data set already is huge and these additional variables inflate it unnecessarily. I have provided code to extract information from the weather string. So I am going to close this as not planned.