nflverse / nflreadr

Efficiently download nflverse data
https://nflreadr.nflverse.com/
Other
58 stars 12 forks source link

Fix `nflreadr::load_player_stats()` naming to be more in line with nflverse convention #237

Closed isaactpetersen closed 3 weeks ago

isaactpetersen commented 3 months ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

I'd like to merge/join variables across data sets. For many of the datasets, there is not a common ID variable to link them. This makes it challenging to merge the datasets.

Describe the solution you'd like

It would be nice for each dataset to have the (relevant) ID variables—with the same spelling—to easily link them to every other dataset. For instance, it would be helpful for every dataset that has players to have a common player_id variable (spelled the same way), and for each dataset that has games/weekly data to have a game_id variable.

This suggestion is similar/related to the following issue: https://github.com/nflverse/nflreadr/issues/31

Describe alternatives you've considered

No response

Additional context

As an example, let's say I want to know a player's age for each week of their historical stats (from load_player_stats()). To calculate their age at a given game, I would need to know the player's birthday and the date of the game, and to calculate the difference between those dates. None of the datasets has all three sets of variables (stats, birthdate, game date), so I would need to merge the datasets. For instance, I could merge the player_stats dataset with the players dataset to get the player's birthdate, and I could merge the dataset with the game schedules dataset to get the game date. This is currently challenging due to there not being common ID columns to merge them. For instance, the player stats dataset has as player_id column, but the players dataset has ID variables with different names (esb_id, gsis_id, gsis_it_id, and smart_id). Just based on the looks of it, player_id in the player stats dataset appears equivalent to gsis_id in the players dataset, but I don't see documentation of that. It would be helpful if they had the same name (if they are equivalent). In addition, although the schedules dataset has a game_id variable, the players stats dataset does not, which makes it much more challenging to merge.

Having standard ID variables for players and games across datasets would make merging the datasets much easier. Thanks very much for your work on this great package!

john-b-edwards commented 1 month ago

Seems like the primary issue here (as with #238) is that columns with nflreadr::load_player_stats() are not named consistent with how other columns are within the nflverse, so we'll focus on that.

We generally advise using gsis_id (or columns that are the gsis_id but may be named differently, like nflreadr::load_player_stats() |> dplyr::pull(player_id)) as the standard for joining on players.

tanho63 commented 3 weeks ago

I don't think we will be renaming columns in the near future, for backwards compatibility with existing databases. Happy to take a PR updating the data dictionaries to improve the documentation around player id columns if confusing.