tianyuan / gitblog

My drafts using issues and GitHub Actions.
0 stars 0 forks source link

Machine Learning Questions #7

Open tianyuan opened 1 year ago

tianyuan commented 1 year ago

1. Why training, validation and test split? What about cross-validation?

2. For imbalance data, what are good measures?

3. What is F1 score?

4. pick one resource for NLP. Must be hands-on, project-based materials. https://www.kaggle.com/learn-guide/natural-language-processing

5. Build of a deep learning model

6. Model formula

# define model
model = keras.___
# compile - loss function and optimizer, learning rate (step)
model.compile(loss=__, optimizer = __)
# fit model (training and validation set, batch_size, and epochs)
history = model.fit(X_train, y_train, batch_size =__, epochs = __, validation = [X_valid, y_valid],))
# look-up the loss function values. 
history_df.loc[5:,['loss']).plot()

7. If the model fail to converge, what are the likely causes?

8. If the model perform poorly (overfitting or underfitting), what are the causes?

9. if the model perform poorly (overfitting or underfitting), how to diagnose it?

Underfitting the training set is when the loss is not as low as it could be because the model hasn't learned enough signal. > Overfitting the training set is when the loss is not as low as it could be because the model learned too much noise. The trick to training deep learning models is finding the best balance between the two.

When we train a model we've been plotting the loss on the training set epoch by epoch. To this we'll add a plot the validation data too. These plots we call the learning curves. To train deep learning models effectively, we need to be able to interpret them. image

Now, the training loss will go down either when the model learns signal or when it learns noise. But the validation loss will go down only when the model learns signal. (Whatever noise the model learned from the training set won't generalize to new data.) So, when a model learns signal both curves go down, but when it learns noise a gap is created in the curves. The size of the gap tells you how much noise the model has learned.

Ideally, we would create models that learn all of the signal and none of the noise. This will practically never happen. Instead we make a trade. We can get the model to learn more signal at the cost of learning more noise. So long as the trade is in our favor, the validation loss will continue to decrease. After a certain point, however, the trade can turn against us, the cost exceeds the benefit, and the validation loss begins to rise.
-- **a tipping point where the gain (signal) is less than the loss (noise). For instance, "early stopping", we keep the model where the validation loss is at a minimum

image

from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=500,
    callbacks=[early_stopping], # put your callbacks in a list
    verbose=0,  # turn off training log
)

These parameters say: "If there hasn't been at least an improvement of 0.001 in the validation loss over the previous 20 epochs, then stop the training and keep the best model you found." It can sometimes be hard to tell if the validation loss is rising due to overfitting or just due to random batch variation. The parameters allow us to set some allowances around when to stop.