huggingface / notebooks

Notebooks using the Hugging Face libraries 🤗
Apache License 2.0
3.7k stars 1.54k forks source link

mT5 fine-tune for en-my got "NaN" in training loss and validation loss #31

Open learnercat opened 3 years ago

learnercat commented 3 years ago

I tried to fine-tune mT5 for English->Myanmar translation from Tatoeba-Challenge Dataset. I followed to train this notebook example of en-ro translation. And I used model_checkpoint as "google/mt5-small". I tested 1~4 epoch training. The following is the training parameters, I reduced the batch_size as 4.

batch_size=4 args = Seq2SeqTrainingArguments( "mt5-translate-en-my", evaluation_strategy = "epoch", learning_rate=2e-5, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, weight_decay=0.01, save_total_limit=3, num_train_epochs=1, predict_with_generate=True, fp16=True, )

I got "NaN" in training loss and validation loss as below:

mt5_error

Can you please help me how do I do it? Thanks in advance.

msaroufim commented 3 years ago

What kind of hardware are you using? Do you get the same issue if you set fp16=False?

Majdoddin commented 1 year ago

@msaroufim Thank you very much, it worked for me on colab. Also the warning lr_schedual.ster() before optimizer.step() disappeared. But why should set fp16=False even when I have A100 GPU on colab?