Code and resources for the 2nd edition of "Hands-on Machine Learning with R: An applied book covering the fundamentals of machine learning with R" by Boehmke & Greenwell
Will eventually add a Data sets section in the intro chapter so we can avoid having to introduce data sets throughout the book (e.g., those used in examples); we don't have to do this for new data sets used in case studies and exercises, etc.
Great candidate for visualization chapter. Can aggregate outcome to estimate/plot proportions within unique values of other features (e.g., lead_time and deposit_type).
The 2021 GSS data are based on a probability sample and could be used for analyzing contingency tables.
Employee attrition data seem hard to come by, so we could treat the IBM HR attrition data as a sample (or take a sample thereof) for analyses (e.g., ordinal association, etc.).
Missing values:
The 2021 GSS data contains lots of missing values.
Will eventually add a Data sets section in the intro chapter so we can avoid having to introduce data sets throughout the book (e.g., those used in examples); we don't have to do this for new data sets used in case studies and exercises, etc.
Binary outcomes:
pay_0
as single feature (shows good probability spread from 0-1 across predictor space)lead_time
anddeposit_type
).deposit_type
(try it w/ simple LR model), which could make for good discussion about role of SME in identifying potential issues with the ETL process. Some discussion on Kaggle at https://www.kaggle.com/code/marcuswingen/eda-of-bookings-and-ml-to-predict-cancelations.resources/articles/yeh-2009-uciblood.pdf
Multinomial (i.e., polytomous) outcomes:
Ordinal outcomes:
Counts:
Inference:
Missing values: