Open learnercat opened 3 years ago
What kind of hardware are you using? Do you get the same issue if you set fp16=False
?
@msaroufim Thank you very much, it worked for me on colab. Also the warning lr_schedual.ster()
before optimizer.step()
disappeared.
But why should set fp16=False
even when I have A100 GPU on colab?
I tried to fine-tune mT5 for English->Myanmar translation from Tatoeba-Challenge Dataset. I followed to train this notebook example of en-ro translation. And I used model_checkpoint as "google/mt5-small". I tested 1~4 epoch training. The following is the training parameters, I reduced the batch_size as 4.
batch_size=4
args = Seq2SeqTrainingArguments(
"mt5-translate-en-my",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=1,
predict_with_generate=True,
fp16=True, )
I got "NaN" in training loss and validation loss as below:
Can you please help me how do I do it? Thanks in advance.