google-research / tuning_playbook

A playbook for systematically maximizing the performance of deep learning models.
Other
26.29k stars 2.18k forks source link

Question about training steps #61

Closed JohnHerry closed 5 months ago

JohnHerry commented 1 year ago

I am learning the part of how to decide batch_size, and how to get the "max_training_steps", and here are two question I want to learn about.

1、How to tune when the best checkpoint stay in the "warmup" stage? During the learning_rate is slowly grow to the setting value, the model had got the best result. I think there must be some hyperparameters error, but no direction about where is the problem may be in.

2、 How to deal with "small dataset" training? I read that the batch_size can affect only the training speed, but if the traning dataset is relatively small, If I use a smaller batch_size, it may view more kinds of sample groups, with shuffling, if I use a bigger batch_size, there are frewer kinds of sample groups, will this really make no sense to the final training result?

varungodbole commented 5 months ago

1、How to tune when the best checkpoint stay in the "warmup" stage? During the learning_rate is slowly grow to the setting value, the model had got the best result. I think there must be some hyperparameters error, but no direction about where is the problem may be in.

It's kind of strange that the best performance on your validation set occurs during your warmup stage. This might potentially indicate overfitting on your validation set. How long are you setting the warmup for, as a percentage of the number of training steps for a given trial in your study?

2、 How to deal with "small dataset" training? I read that the batch_size can affect only the training speed, but if the traning dataset is relatively small, If I use a smaller batch_size, it may view more kinds of sample groups, with shuffling, if I use a bigger batch_size, there are frewer kinds of sample groups, will this really make no sense to the final training result?

It's hard to know the "absolute best thing" to do in such situations. Some believe that smaller batches have regularization properties due to the minibatch noise that you describe. But in practice, we've been able to tune around that with the other techniques listed below.

  1. Increase the regularization (e.g. weight decay).
  2. Use a better optimizer (e.g. https://github.com/google-research/google-research/blob/master/scalable_shampoo/optax/distributed_shampoo.py)
  3. Use data augmentation to introduce "new" examples to the dataset.
JohnHerry commented 5 months ago

1、How to tune when the best checkpoint stay in the "warmup" stage? During the learning_rate is slowly grow to the setting value, the model had got the best result. I think there must be some hyperparameters error, but no direction about where is the problem may be in.

It's kind of strange that the best performance on your validation set occurs during your warmup stage. This might potentially indicate overfitting on your validation set. How long are you setting the warmup for, as a percentage of the number of training steps for a given trial in your study?

We are training a simple text classifier with pretrained BERT as encoder. our datasize is hundreds of thousands, and the warmup steps is 400-1000 steps - less then one fourth of one epoch.

2、 How to deal with "small dataset" training? I read that the batch_size can affect only the training speed, but if the traning dataset is relatively small, If I use a smaller batch_size, it may view more kinds of sample groups, with shuffling, if I use a bigger batch_size, there are frewer kinds of sample groups, will this really make no sense to the final training result?

It's hard to know the "absolute best thing" to do in such situations. Some believe that smaller batches have regularization properties due to the minibatch noise that you describe. But in practice, we've been able to tune around that with the other techniques listed below.

  1. Increase the regularization (e.g. weight decay).
  2. Use a better optimizer (e.g. https://github.com/google-research/google-research/blob/master/scalable_shampoo/optax/distributed_shampoo.py)
  3. Use data augmentation to introduce "new" examples to the dataset.

Thank you for share that.