koalaverse / homlr

Supplementary material for Hands-On Machine Learning with R, an applied book covering the fundamentals of machine learning with R.
https://koalaverse.github.io/homlr
Creative Commons Attribution Share Alike 4.0 International
229 stars 88 forks source link

Typos/Suggestions/Questions for Chapter 4/5/6/7 #25

Closed DesmondChoy closed 4 years ago

DesmondChoy commented 4 years ago

Reference date of book: 2019-12-06

Chapter 4: Linear Regression

4.2.2 Inference Notes

(Ctrl-f) "Regresion" & "Remdial"

[4.7 Partial least squares]

set.seed(123)

cv_model_pls <- train(
  Sale_Price ~ ., 
  data = ames_train, 
  method = "pls",
  trControl = trainControl(method = "cv", number = 10),
  preProcess = c("zv", "center", "scale"),
  tuneLength = 20
)

#model with lowest RMSE
cv_model_pls$bestTune

I'm not able to replicate n=3 with cv_model_pls$bestTune. I've tried it on two different computers, and I’m getting closer to m=19 or 20. I experimented with tuneLength = 40 and cv_model_pls$bestTune was between 19-21. Given the big discrepancy between m=3 and m=19, I thought I'd flag it out.

After reading this line "Using PLS with m=3 principal components corresponded with the lowest cross-validated RMSE of $29,970", I was wondering how would I go about verifying the RMSE other than looking at the ggplot graph itself.

Suggestion: Consider including the following code to aid the reader in extracting the lowest RMSE for themselves:

library(tidyverse)
#assuming $bestTune gives n = 19
cv_model_pls$results %>%
  dplyr::filter(ncomp == 19)

Fig 4.10

There's a typo in the caption: The 10-fold cross "valdation" RMSE

Online supplementary material

(https://koalaverse.github.io/homlr/notebooks/04-linear-regression.nb.html), there's a section with repetitive words: (Ctrl-f) “Prediction from a rank-deficient fit…”

Chapter 5: Logistic Regression

5.5 Assessing model accuracy

"There are 16 numeric features in our data set so the following code performs a 10-fold cross-validated PLS model while tuning the number of principal components to use from 1–16. "

Suggestion - Consider including the following code to allow reader to extract number of numeric features for themselves:

length(attrition[sapply(attrition, is.numeric)])

Suggestion - Consider including the following code to allow reader to extract lowest RMSE for themselves:

cv_model_pls$results %>%
  dplyr::filterfilter(ncomp == 14)

Question - Could you elaborate on what’s the intuition behind limiting tuneLength to number of numeric features? Why can't we set tuneLength to the number of all features?

Chapter 6: Regularized Regression

6.2 Why regularize?

(Ctrl-f) "classicial" (Ctrl-f) bet on sparsity principal - should be "principle"

6.3 Implementation

(Ctrl-f) Here we just peak - should be "peek"

6.4 Tuning

Suggestion - Consider including the following code to allow reader to extract Lasso coefficient for the lowest MSE:

lasso$nzero[lasso$lambda == lasso$lambda.min] # No. of coef | Min MSE
lasso$nzero[lasso$lambda == lasso$lambda.1se] # No. of coef | 1-SE MSE

Chapter 7: Multivariate Adaptive Regression Splines

7.5 Feature Interpretation

With the latest version of vip (0.2.1), the following code below gives an warning/error

# variable importance plots
> p1 <- vip(cv_mars, num_features = 40, bar = FALSE, value = "gcv") + ggtitle("GCV")

Warning message:
In vip.default(cv_mars, num_features = 40, bar = FALSE, value = "gcv") :
  The `bar` argument has been deprecated in favor of the new `geom` argument. It will be removed in version 0.3.0.
> p2 <- vip(cv_mars, num_features = 40, bar = FALSE, value = "rss") + ggtitle("RSS")

Warning message:
In vip.default(cv_mars, num_features = 40, bar = FALSE, value = "rss") :
  The `bar` argument has been deprecated in favor of the new `geom` argument. It will be removed in version 0.3.0.

Suggestion: Code tweaked below.

p1 <- vip(cv_mars, num_features = 40, geom = "point",value = "gcv") + ggtitle("GCV")

p2 <- vip(cv_mars, num_features = 40, geom = "point", value = "rss") + ggtitle("RSS")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Thank you!

bradleyboehmke commented 4 years ago

Hey @DesmondChoy, these are all great. I made updates to the online version so you will see updates in each chapter relating to your points.

Just an FYI - the reproducibility issue was due to the change in sampling procedures for R 3.6.0 (http://bit.ly/35D1SW7). Due to the time it takes to produce the book we have to cache a lot of the code output. Consequently, that chapter was caching the ames train/test split prior to the change in sampling procedures.