Potential book layout - Githubissues

bradleyboehmke commented 8 months ago

Fundamentals

Introduction to ML
- What is ML
- Types of ML systems
- ML in R
Before the modeling process
- Problem framing
- Planning & scoping
- Experimentation
- Production
The basic modeling process
- Data splitting
- Building models
- Making predictions
- Model evaluation
  - Understanding residuals
  - Aggregate residual metrics
  - Performance plots (i.e. ROC curve, lift chart)
Data preprocessing
- Target engineering
- Missing values
- Feature filtering
- Numeric feature engineering
- Categorical feature engineering
- Data compression (PCA)
A more robust modeling process
- Bias-variance trade-off
- Resampling
- Hyperparameter tuning
Model trust
- Ethics
- Interpretability vs. Explainability
- Global explainability
- Local explainability

Supervised Modeling

Linear regression
Logistic regression
- Add section on Multinomial problems
Regularized regression
Transitioning to non-linearity
- Polynomial
- MARS
- GAMS
KNN
Decision trees
Bagging
Random forests
Gradient boosting
Support vector machines
Stacked models

Deep Learning

Intro to DL
The DL modeling process
Transfer learning
Computer vision
Word embeddings
Language models

bradleyboehmke commented 8 months ago

@bgreenwell, I thought a lot about our recent discussions and it made me go back and reconsider the layout. Above is a proposed new TOC layout. The middle section doesn't change a whole lot but the first section adds some new content that I think would help set the book apart.

For example, ch 2 would talk about framing and scoping ML problems along with thinking about production concerns. This is where we can mention things around the lifecycle of an ML project (i.e. drift) but we mention that our book does not focus on this topic (we can point to other resources).

Also, notice that I remove the unsupervised section but add in a DL section. This modernizes the book plus, I already have a lot of DL notebooks built out that I can migrate so this is starting from scratch.

What are your thoughts?

bgreenwell commented 7 months ago

Lots to discuss at our next catch-up, but here's some (very) high-level thoughts:

The proposed chapter 2 makes me think about the Microsoft ML checklist, which I really like. Can we try to incorporate and/or align with that? Are there others?
In the interest of any discussion on leakage, I think preprocessing should be introduced and precede data splitting in chapter 3; then point to the latter chapter on pre-processing methods (but this ties in STRONGLY with leakage). Maybe this is where we introduce the leakage framework?
Unsupervised is missing?
I think we need need a special chapter on additional topics up front?. E.g., missing values, collinearity in general, interpretability, variable selection and ranking, "Responsible AI", ...
- I say up front because I think it's too critical to leave for the end, but also hard to discuss prior to the core content. Still pondering on this.
I don't like the idea of deep learning being separated from the rest, but perhaps it's worthwhile because of it's broader applications, like embeddings, etc.? But same goes for random forests (e.g., isolation forests for anomaly detection) and many other methods. I can be persuaded here, but that's my initial thought.

koalaverse / homlr-2ed

Potential book layout #14