add learning rate visualisation and manual parameter

lfoppiano commented 1 year ago

In line with the use of the incremental training, knowing the final learning rate of the "previous training", and the ability to set it manually can be helpful

This PR (updated list):

adds a new parameter --learning-rate to override the default learning rate value to the *Tagging applications
removes all the hard-coded learning rates and set the default values as discussed in https://github.com/kermitt2/delft/pull/161#issuecomment-1545941421 :
- transformers (2e-5) and
- RNN (0.0001)
add a callback that prints the learning rate at the end of each epoch

kermitt2 commented 1 year ago

Hi Luca !

I realize that the default learning rate should depend on the architecture and that was not well done - low learning rates like 0.0001 or lower are typical for for BERT, but RNN need much higher value. So the older default (0.001) was too high for BERT I think, but this new default one (0.0001) in the PR is now too low for RNN.

It's definitively useful to add it as command line parameter, but I think we should set the default learning rate in the configure() functions for the different application depending on the selected architecture.

kermitt2 commented 1 year ago

Ok I double check: for both sequence labeling and text classification the learning rate for all transformer architectures is hard coded at init_lr=2e-5 in the decay optimizer (this is the usual value). It's not using the config value.

Only RNN models were using the config learning rate value, and the default (0.001) was set for this.

kermitt2 commented 1 year ago

So that was my assumption when I added the decay optimizers:

1) for transformers we always use 2e-5 as learning rate because everybody uses that value and we don't want to change it (I remember vaguely having tested 1e-5 but it was very slightly worse and higher values are not recommended because they make the model more "forgetting" some training examples).

2) for RNN models, changing the learning rate is more usual, so it uses the config value.

lfoppiano commented 1 year ago

Thanks for the clarification. I think having the configurable parameter could be useful for example to lower it for incremental training. I propose the following:

we fetch the data from the trainingconfig (which will then print the right value at startup)
I set the default value in wrapper to None and reset the default in the constructor based on the fact that it's a transformer or not

We can set the value also in the application, but at least we dont' risk to run it with the wrong default value.

Let me know if this makes sense

lfoppiano commented 1 year ago

I've fixed the default values (also in the classification trainer).

I've added a callback that prints the LR decayed at each epoch, however I have the following:

---
max_epoch: 60
early_stop: True
patience: 5
batch_size (training): 80
max_sequence_length: 30
model_name: grobid-date-BERT
learning_rate:  2e-05
use_ELMo:  False
---
[...]
__________________________________________________________________________________________________
Epoch 1/60
8/8 [==============================] - ETA: 0s - loss: 2.0593   f1 (micro): 47.24
8/8 [==============================] - 69s 8s/step - loss: 2.0593 - f1: 0.4724 - learning_rate: 3.8095e-06
Epoch 2/60
8/8 [==============================] - ETA: 0s - loss: 1.2964   f1 (micro): 82.82
8/8 [==============================] - 43s 4s/step - loss: 1.2964 - f1: 0.8282 - learning_rate: 7.6190e-06
Epoch 3/60
8/8 [==============================] - ETA: 0s - loss: 0.6858   f1 (micro): 87.61
8/8 [==============================] - 29s 4s/step - loss: 0.6858 - f1: 0.8761 - learning_rate: 1.1429e-05
Epoch 4/60
8/8 [==============================] - ETA: 0s - loss: 0.3628   f1 (micro): 92.73
8/8 [==============================] - 29s 4s/step - loss: 0.3628 - f1: 0.9273 - learning_rate: 1.5238e-05
Epoch 5/60
8/8 [==============================] - ETA: 0s - loss: 0.1840   f1 (micro): 94.89
8/8 [==============================] - 15s 2s/step - loss: 0.1840 - f1: 0.9489 - learning_rate: 1.9048e-05
Epoch 6/60
8/8 [==============================] - ETA: 0s - loss: 0.1167   f1 (micro): 94.61
8/8 [==============================] - 25s 3s/step - loss: 0.1167 - f1: 0.9461 - learning_rate: 1.9683e-05
Epoch 7/60
8/8 [==============================] - ETA: 0s - loss: 0.0769   f1 (micro): 94.89
8/8 [==============================] - 23s 3s/step - loss: 0.0769 - f1: 0.9489 - learning_rate: 1.9259e-05
Epoch 8/60
8/8 [==============================] - ETA: 0s - loss: 0.0656   f1 (micro): 95.50
8/8 [==============================] - 23s 3s/step - loss: 0.0656 - f1: 0.9550 - learning_rate: 1.8836e-05
Epoch 9/60
8/8 [==============================] - ETA: 0s - loss: 0.0562   f1 (micro): 95.50
8/8 [==============================] - 36s 5s/step - loss: 0.0562 - f1: 0.9550 - learning_rate: 1.8413e-05
Epoch 10/60
8/8 [==============================] - ETA: 0s - loss: 0.0514   f1 (micro): 96.10
8/8 [==============================] - 21s 3s/step - loss: 0.0514 - f1: 0.9610 - learning_rate: 1.7989e-05
Epoch 11/60
8/8 [==============================] - ETA: 0s - loss: 0.0424   f1 (micro): 96.70
8/8 [==============================] - 19s 2s/step - loss: 0.0424 - f1: 0.9670 - learning_rate: 1.7566e-05
Epoch 12/60
8/8 [==============================] - ETA: 0s - loss: 0.0348   f1 (micro): 96.70
8/8 [==============================] - 17s 2s/step - loss: 0.0348 - f1: 0.9670 - learning_rate: 1.7143e-05
Epoch 13/60
8/8 [==============================] - ETA: 0s - loss: 0.0348   f1 (micro): 96.70
8/8 [==============================] - 38s 5s/step - loss: 0.0348 - f1: 0.9670 - learning_rate: 1.6720e-05
Epoch 14/60
8/8 [==============================] - ETA: 0s - loss: 0.0294   f1 (micro): 96.70
8/8 [==============================] - 36s 5s/step - loss: 0.0294 - f1: 0.9670 - learning_rate: 1.6296e-05
Epoch 15/60
8/8 [==============================] - ETA: 0s - loss: 0.0244   f1 (micro): 96.10
8/8 [==============================] - 19s 2s/step - loss: 0.0244 - f1: 0.9610 - learning_rate: 1.5873e-05
Epoch 16/60
8/8 [==============================] - ETA: 0s - loss: 0.0251   f1 (micro): 96.10
8/8 [==============================] - 12s 1s/step - loss: 0.0251 - f1: 0.9610 - learning_rate: 1.5450e-05
training runtime: 457.32 seconds 
model config file saved
preprocessor saved
transformer config saved
transformer tokenizer saved
model saved

The initial learning rate is 2e-05 (0.00002), but 🤔 🤔 it seems that in the first epochs it's floating up and down, before decreasing after epoch 6. Is this normal? AFAIK it should go down Trying to figure out that, I noticed that in the wrapper, there is a parameter lr_decay=0.9 but I'm not able to see it used anyway, so for example for the transformer we have :

optimizer, lr_schedule = create_optimizer(
                init_lr=self.training_config.learning_rate,
                num_train_steps=nb_train_steps,
                weight_decay_rate=0.01,
                num_warmup_steps=0.1 * nb_train_steps,
            )

or, for non transformers, with the Adam optimizer:

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
                initial_learning_rate=self.training_config.learning_rate,
                decay_steps=nb_train_steps,
                decay_rate=0.1)
            optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

For this case should I assume decay_rate = 1- lr_decay? What about the transformers?

lfoppiano commented 1 year ago

By removing the warmup steps, the learning rate does not float around. Are we sure the warmup steps are necessary for fine-tuning? 🤔

kermitt2 commented 1 year ago

Yes normally warm-up is important when fine-tuning with transformers and ELMo (if I remember well, warmup is more important than the decay in learning rate!).

The create_optimizer method that manages learning rate and warmup comes directly from the transformer library and it might work as expected with the up and down. The warmup applies in the first epochs with a lower learning rates to avoid sudden overfitting at the very beginning of the training. So with warmup, LR should start lower than init_lr, and, only after the warmup phase is done, the LR has then the init_lr value.

kermitt2 / delft

add learning rate visualisation and manual parameter #161