nateschor / OBP

0 stars 0 forks source link

Perform EDA #3

Closed nateschor closed 1 year ago

nateschor commented 1 year ago

I want to include more data from the Lahman package, both to have more observations for OBP and for more potential predictors. Afterwards, it is time for EDA:

    • [x] download Batting data
    • [x] construct ggridge plot to see if 2020 data should be excluded (shape of distribution is different), which year(s) to use as validation set, and how many years of data should be used
    • [x] DataExplorer to look at data and figure out good possible predictors and how to handle outliers, NAs
nateschor commented 1 year ago

Summary: In this issue, I grabbed the Lahman dataset and augmented it with lags 1-5 of each of the Lahman variables. I made 1-5 year lag plots and determined that while there is a correlation between current OBP and lags of OBP, lags of OBP do not seem to account for all variation in current OBP (see report/figures). I also determined that while the distribution of OBP in 2020 looks differently from other seasons, it is still correlated with 2019. Adjusting for the shortened 2020 season by calculating OBP / PA in this plot does not help either and leads to small values that may not be numerically stable (potential issues with matrix inversion)

in #4, I will likely handle NAs by filtering and also determine values of x and y for filtering $x > OBP > y$

Stable link here

Merged into main in https://github.com/nateschor/OBP/commit/f6bb1f9dd4cb262d21530a5029f1a65c10d47b64