softwareunderground / 52things

52 Things You Should Know About Geocomputing
102 stars 61 forks source link

Simple Machine Learning - review (chapter 21, Didi Ooi) #112

Open mycarta opened 4 years ago

mycarta commented 4 years ago

Overall nicely written chapter. I like the style, the structure, and the objectives, which I think are met. However, I have a few comments on specifics of Machine Learning; see below, organized by section. I may call on others to help out. Ultimately it may need further work from the author.

1. Understand each variable independently
About determining the normality: I recently had an in-depth discussion with a friend (a statistician) about this becasue I was confused by contradicting recommendations in this regards - he assured me there are no distributional assumptions on the predictors, only on the dependent variable, so this needs to be clarified.

2. Feature engineering
All good

3. Understand bivariate relationship
All good

4. Exploit multivariate patterns
I would not only use PCA. I would consider suggesting multiple methods to explore multivariate relationships / variable importance, ideally a combination of model based and some not model based, and decide base on majority vote (variables most methods agree upon).

5. Train your Machine Learning model In here we have a recommendation for a 80/20 training / validation split. THis needs to be clarified on two levels:

  1. the terminology. It is unclear to me what the author means with Validation (for terminology I try to stick to Sebastian Raschka's, see diagram below:

Screen Shot 2020-06-13 at 4 37 27 PM

  1. If the intended meaning is just an 80 train/test set like in the first row in the diagram, then it may be ok, although 80/20 is seldom a good generic split; I could be wrong but I have a sense the author may be referring to the second row because she mentions training competitive models, in which case this approach would be incorrect. It certainly needs to be clarified.

6. Prediction! All good