Closed lfoppiano closed 1 year ago
Hi Luca !
I realize that the default learning rate should depend on the architecture and that was not well done - low learning rates like 0.0001 or lower are typical for for BERT, but RNN need much higher value. So the older default (0.001) was too high for BERT I think, but this new default one (0.0001) in the PR is now too low for RNN.
It's definitively useful to add it as command line parameter, but I think we should set the default learning rate in the configure()
functions for the different application depending on the selected architecture.
Ok I double check: for both sequence labeling and text classification the learning rate for all transformer architectures is hard coded at init_lr=2e-5
in the decay optimizer (this is the usual value). It's not using the config value.
Only RNN models were using the config learning rate value, and the default (0.001) was set for this.
So that was my assumption when I added the decay optimizers:
1) for transformers we always use 2e-5
as learning rate because everybody uses that value and we don't want to change it (I remember vaguely having tested 1e-5
but it was very slightly worse and higher values are not recommended because they make the model more "forgetting" some training examples).
2) for RNN models, changing the learning rate is more usual, so it uses the config value.
Thanks for the clarification. I think having the configurable parameter could be useful for example to lower it for incremental training. I propose the following:
We can set the value also in the application, but at least we dont' risk to run it with the wrong default value.
Let me know if this makes sense
I've fixed the default values (also in the classification trainer).
I've added a callback that prints the LR decayed at each epoch, however I have the following:
---
max_epoch: 60
early_stop: True
patience: 5
batch_size (training): 80
max_sequence_length: 30
model_name: grobid-date-BERT
learning_rate: 2e-05
use_ELMo: False
---
[...]
__________________________________________________________________________________________________
Epoch 1/60
8/8 [==============================] - ETA: 0s - loss: 2.0593 f1 (micro): 47.24
8/8 [==============================] - 69s 8s/step - loss: 2.0593 - f1: 0.4724 - learning_rate: 3.8095e-06
Epoch 2/60
8/8 [==============================] - ETA: 0s - loss: 1.2964 f1 (micro): 82.82
8/8 [==============================] - 43s 4s/step - loss: 1.2964 - f1: 0.8282 - learning_rate: 7.6190e-06
Epoch 3/60
8/8 [==============================] - ETA: 0s - loss: 0.6858 f1 (micro): 87.61
8/8 [==============================] - 29s 4s/step - loss: 0.6858 - f1: 0.8761 - learning_rate: 1.1429e-05
Epoch 4/60
8/8 [==============================] - ETA: 0s - loss: 0.3628 f1 (micro): 92.73
8/8 [==============================] - 29s 4s/step - loss: 0.3628 - f1: 0.9273 - learning_rate: 1.5238e-05
Epoch 5/60
8/8 [==============================] - ETA: 0s - loss: 0.1840 f1 (micro): 94.89
8/8 [==============================] - 15s 2s/step - loss: 0.1840 - f1: 0.9489 - learning_rate: 1.9048e-05
Epoch 6/60
8/8 [==============================] - ETA: 0s - loss: 0.1167 f1 (micro): 94.61
8/8 [==============================] - 25s 3s/step - loss: 0.1167 - f1: 0.9461 - learning_rate: 1.9683e-05
Epoch 7/60
8/8 [==============================] - ETA: 0s - loss: 0.0769 f1 (micro): 94.89
8/8 [==============================] - 23s 3s/step - loss: 0.0769 - f1: 0.9489 - learning_rate: 1.9259e-05
Epoch 8/60
8/8 [==============================] - ETA: 0s - loss: 0.0656 f1 (micro): 95.50
8/8 [==============================] - 23s 3s/step - loss: 0.0656 - f1: 0.9550 - learning_rate: 1.8836e-05
Epoch 9/60
8/8 [==============================] - ETA: 0s - loss: 0.0562 f1 (micro): 95.50
8/8 [==============================] - 36s 5s/step - loss: 0.0562 - f1: 0.9550 - learning_rate: 1.8413e-05
Epoch 10/60
8/8 [==============================] - ETA: 0s - loss: 0.0514 f1 (micro): 96.10
8/8 [==============================] - 21s 3s/step - loss: 0.0514 - f1: 0.9610 - learning_rate: 1.7989e-05
Epoch 11/60
8/8 [==============================] - ETA: 0s - loss: 0.0424 f1 (micro): 96.70
8/8 [==============================] - 19s 2s/step - loss: 0.0424 - f1: 0.9670 - learning_rate: 1.7566e-05
Epoch 12/60
8/8 [==============================] - ETA: 0s - loss: 0.0348 f1 (micro): 96.70
8/8 [==============================] - 17s 2s/step - loss: 0.0348 - f1: 0.9670 - learning_rate: 1.7143e-05
Epoch 13/60
8/8 [==============================] - ETA: 0s - loss: 0.0348 f1 (micro): 96.70
8/8 [==============================] - 38s 5s/step - loss: 0.0348 - f1: 0.9670 - learning_rate: 1.6720e-05
Epoch 14/60
8/8 [==============================] - ETA: 0s - loss: 0.0294 f1 (micro): 96.70
8/8 [==============================] - 36s 5s/step - loss: 0.0294 - f1: 0.9670 - learning_rate: 1.6296e-05
Epoch 15/60
8/8 [==============================] - ETA: 0s - loss: 0.0244 f1 (micro): 96.10
8/8 [==============================] - 19s 2s/step - loss: 0.0244 - f1: 0.9610 - learning_rate: 1.5873e-05
Epoch 16/60
8/8 [==============================] - ETA: 0s - loss: 0.0251 f1 (micro): 96.10
8/8 [==============================] - 12s 1s/step - loss: 0.0251 - f1: 0.9610 - learning_rate: 1.5450e-05
training runtime: 457.32 seconds
model config file saved
preprocessor saved
transformer config saved
transformer tokenizer saved
model saved
The initial learning rate is 2e-05
(0.00002), but 🤔 🤔 it seems that in the first epochs it's floating up and down, before decreasing after epoch 6. Is this normal? AFAIK it should go down
Trying to figure out that, I noticed that in the wrapper, there is a parameter lr_decay=0.9
but I'm not able to see it used anyway, so for example for the transformer we have :
optimizer, lr_schedule = create_optimizer(
init_lr=self.training_config.learning_rate,
num_train_steps=nb_train_steps,
weight_decay_rate=0.01,
num_warmup_steps=0.1 * nb_train_steps,
)
or, for non transformers, with the Adam optimizer:
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=self.training_config.learning_rate,
decay_steps=nb_train_steps,
decay_rate=0.1)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
For this case should I assume decay_rate = 1- lr_decay
?
What about the transformers?
By removing the warmup steps, the learning rate does not float around. Are we sure the warmup steps are necessary for fine-tuning? 🤔
Yes normally warm-up is important when fine-tuning with transformers and ELMo (if I remember well, warmup is more important than the decay in learning rate!).
The create_optimizer
method that manages learning rate and warmup comes directly from the transformer library and it might work as expected with the up and down. The warmup applies in the first epochs with a lower learning rates to avoid sudden overfitting at the very beginning of the training. So with warmup, LR should start lower than init_lr, and, only after the warmup phase is done, the LR has then the init_lr value.
In line with the use of the incremental training, knowing the final learning rate of the "previous training", and the ability to set it manually can be helpful
This PR (updated list):
--learning-rate
to override the default learning rate value to the*Tagging
applications2e-5
) and0.0001
)