If our model does much better on the training set than on the test set, then we’re likely overfitting.
Overfitting occurs if the model or algorithm shows low bias but high variance (whereas, underfitting occurs if the model or algorithm shows low variance but high bias)
Why?
Too powerful model (models with large exponents (e.g 100-degree polynomial)-> map into multi-dimensional space)
Not enough data: Getting more data can sometimes fix overfitting problems
Too many features: Like figure below:
How prevent?
1. Cross-validation
Use your initial training data to generate multiple mini train-test splits. Use these splits to tune your model.
Keep test data is unseen, only use train data for training.
2. Training with more data
It won’t work every time, but training with more data can help algorithms detect the signal better
The more data we have, the better our model is generalized (of course, the data must be clean 😁)
3. Remove features
Some algorithms have built-in feature selection.
For those that don’t, you can manually improve their generalizability by removing irrelevant input features
4. Early stopping
Allows stopping early after a certain number of times that the error (accuracy) of the model has not changed (or changed too little)
5. Regularization
Techniques for artificially forcing your model to be simpler
Some famous ones: L1, L2
6. Ensembling
Combining predictions from multiple separate models
There are a few different methods for ensembling, but the two most common are:
Bagging attempts to reduce the chance of overfitting complex models. (Trains a large number of "strong" learners in parallel and get the final result by voting from those learners)
Boosting attempts to improve the predictive flexibility of simple models. (Trains a large number of "weak" learners in sequence and combines all the weak learners into a single strong learner)
7. Dropout
Use in neural networks
Every unit of our neural network (except those belonging to the output layer) is given the probability p of being temporarily ignored in calculations
TL;DR
It contains some notes for overfitting.
Link article
https://elitedatascience.com/overfitting-in-machine-learning#overfitting-vs-underfitting
Key takeaways
What?
Why?
How prevent?
1. Cross-validation
2. Training with more data
3. Remove features
4. Early stopping Allows stopping early after a certain number of times that the error (accuracy) of the model has not changed (or changed too little)
5. Regularization
6. Ensembling
7. Dropout