set a loss function - measure of model performance, e.g., absolute error
set an optimizer - e.g. adam. Stochastic Gradient descent.
set batch_size (a hyperparameter of gradient descent, controlling the number of training samples, e.g. batch_size = 20, sample size = 100, then it means there will be 5 batches).
set epochs (e.g., like iterations)
6. Model formula
# define model
model = keras.___
# compile - loss function and optimizer, learning rate (step)
model.compile(loss=__, optimizer = __)
# fit model (training and validation set, batch_size, and epochs)
history = model.fit(X_train, y_train, batch_size =__, epochs = __, validation = [X_valid, y_valid],))
# look-up the loss function values.
history_df.loc[5:,['loss']).plot()
7. If the model fail to converge, what are the likely causes?
large learning rates. Large learning rates can speed up training, but don't "settle in" to a minimum as well. When the learning rate is too large, the training can fail completely. (Try setting the learning rate to a large value like 0.99 to see this.)
https://www.kaggle.com/code/yuantian0/exercise-stochastic-gradient-descent/edit
8. If the model perform poorly (overfitting or underfitting), what are the causes?
There can be two problems that occur when training a model: not enough signal (the part that can help our model make predictions from new data) or too much noise (the part that is only true of the training data, random fluctuation).
"not enough signal" might be applicable to imbalanced data.
9. if the model perform poorly (overfitting or underfitting), how to diagnose it?
Underfitting the training set is when the loss is not as low as it could be because the model hasn't learned enough signal. > Overfitting the training set is when the loss is not as low as it could be because the model learned too much noise. The trick to training deep learning models is finding the best balance between the two.
When we train a model we've been plotting the loss on the training set epoch by epoch. To this we'll add a plot the validation data too. These plots we call the learning curves. To train deep learning models effectively, we need to be able to interpret them.
Now, the training loss will go down either when the model learns signal or when it learns noise. But the validation loss will go down only when the model learns signal. (Whatever noise the model learned from the training set won't generalize to new data.) So, when a model learns signal both curves go down, but when it learns noise a gap is created in the curves. The size of the gap tells you how much noise the model has learned.
Ideally, we would create models that learn all of the signal and none of the noise. This will practically never happen. Instead we make a trade. We can get the model to learn more signal at the cost of learning more noise. So long as the trade is in our favor, the validation loss will continue to decrease. After a certain point, however, the trade can turn against us, the cost exceeds the benefit, and the validation loss begins to rise.
-- **a tipping point where the gain (signal) is less than the loss (noise). For instance, "early stopping", we keep the model where the validation loss is at a minimum
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(
min_delta=0.001, # minimium amount of change to count as an improvement
patience=20, # how many epochs to wait before stopping
restore_best_weights=True,
)
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=256,
epochs=500,
callbacks=[early_stopping], # put your callbacks in a list
verbose=0, # turn off training log
)
These parameters say: "If there hasn't been at least an improvement of 0.001 in the validation loss over the previous 20 epochs, then stop the training and keep the best model you found." It can sometimes be hard to tell if the validation loss is rising due to overfitting or just due to random batch variation. The parameters allow us to set some allowances around when to stop.
1. Why training, validation and test split? What about cross-validation?
2. For imbalance data, what are good measures?
3. What is F1 score?
4. pick one resource for NLP. Must be hands-on, project-based materials. https://www.kaggle.com/learn-guide/natural-language-processing
5. Build of a deep learning model
adam
. Stochastic Gradient descent.batch_size
(a hyperparameter of gradient descent, controlling the number of training samples, e.g. batch_size = 20, sample size = 100, then it means there will be 5 batches).epochs
(e.g., like iterations)6. Model formula
7. If the model fail to converge, what are the likely causes?
8. If the model perform poorly (overfitting or underfitting), what are the causes?
9. if the model perform poorly (overfitting or underfitting), how to diagnose it?