DeBERTa pretraining using MLM: model gradients become NAN

mansimane commented 3 years ago

Environment info

transformers version: 4.3.0.dev0
Platform: Ubuntu
Python version: 3.6.12
PyTorch version : 1.7.1
Using GPU in script?: Y
Using distributed or parallel set-up in script?: Y, using 8 GPU machine.

Who can help

@BigBird01 @NielsRogge

Models: DeBERTa Base

Information

I am using DeBERTa base model and training it with Masked Language Modeling task using single file from wikipedia text dataset. For the first step the loss is around 11 and after backward pass, gradients become nan and gradient norm goes to infinity. I reduced learning rate from 1e-4 to 5e-10, still the issue persists. Batch size per GPU is 32 and with 8 GPUs, total batchsize becomes 256. Configured hyperparameters according to paper are as below.

Number of Layers: 12
Hidden size: 768
FNN inner hidden size: 3072
Attention Heads: 12
Attention Head size: 64
Dropout: 0.1
Warmup Steps: 10k
Learning Rates: 1e-4
Batch Size: 256
Weight Decay: 0.01
Max Steps: 1M
Learning Rate Decay: Linear
Adam ε: 1e-6
Adam β1: 0.9
Adam β2: 0.999
Gradient Clipping: 1.0

To reproduce

Steps to reproduce the behavior:


from transformers import (
    DebertaConfig,
    DebertaTokenizer,
    DebertaForMaskedLM,
    LineByLineTextDataset,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments
)

tokenizer = DebertaTokenizer.from_pretrained('microsoft/deberta-base')

train_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="/data/wikidemo/wiki_01",
    block_size=128,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
config = DebertaConfig()

model = DebertaForMaskedLM(config=config)

training_args = TrainingArguments(
    output_dir="./deberta",
    overwrite_output_dir=True,

    num_train_epochs=1000,
    per_gpu_train_batch_size=2,
    learning_rate=5e-10,
    weight_decay=0.01,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e06,
    max_grad_norm=1.0,
    save_steps=10_000,
    save_total_limit=2,
    logging_first_step=False,
    logging_steps=1,
    max_steps=10000,
    gradient_accumulation_steps=10,

)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

print("Starting training")
trainer.train()

patil-suraj commented 3 years ago

hi @mansimane

In your code in TrainingArguments, adam_epsilon is set to 1e06, which quite a large value, I believe it's a typo, should be 1e-6 as mentioned in the comment. This could be the reason for nan gradients.

mansimane commented 3 years ago

Thanks @patil-suraj for the catch. I fixed the Adam epsilon, but still some gradients are becoming infinity and nan after first backward pass. Following is the config I tried

training_args = TrainingArguments(
    output_dir="./deberta",
    overwrite_output_dir=True,

    num_train_epochs=1000,
    per_gpu_train_batch_size=32,
    learning_rate=1e-10,

    warmup_steps=10000,
    weight_decay=0.01,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-6,
    max_grad_norm=1.0,
    save_steps=10_000,
    save_total_limit=2,
    logging_first_step=False,
    logging_steps=1,
    max_steps=10000,
    gradient_accumulation_steps=1,

)

NielsRogge commented 3 years ago

Hi,

sorry for the late reply. I tested MLM with DeBertaForMaskedLM using the run_mlm.py script, and everything seems to be working fine. So it seems like a hyperparameter issue (I would suggest using the same hyperparameter values as this script). Your learning rate for example seems way too low.

My Google colab to reproduce: https://colab.research.google.com/drive/1Rk5JoBTzK0I8J3FjG2R4J9HCeOrUpRTt?usp=sharing

gaceladri commented 3 years ago

I am having the same issue but with MobileBert after loading a pre-trained model. I trained from scratch a LM 23000 steps. Now loading the model mobilebert.from_pretrained() to reload the model and keep training. Now when I try to keep training the loss i NaN. I have removed all related to learning rate in the training args and the nans keep appearing.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./mobile_linear_att_4Heads_8L_128_512_03layerdrop_shared_all_dataset_1",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=95,
    save_steps=50,
    save_total_limit=2,
    logging_first_step=True,
    logging_steps=50,
    gradient_accumulation_steps=8,
    fp16=True,
    dataloader_num_workers=19,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=big_dataset,
    tokenizer=tokenizer)

trainer.train()

EDIT: After some debugging I looked into the "trainer_state.json" and I have seen that before finishing the last training I got NaNs into the model so, it is nothing related to learning rate o something at this moment.

{
      "cuda max_memory_reserved": 23460839424,
      "cuda memory cached": 23460839424,
      "cuda memory consumption": 111139328,
      "epoch": 0.99,
      "learning_rate": 0.0004937288135593219,
      "loss": 4.5816,
      "num_parameters": 5920442,
      "step": 22900
    },
    {
      "cuda max_memory_reserved": 23460839424,
      "cuda memory cached": 23460839424,
      "cuda memory consumption": 111139328,
      "epoch": 0.99,
      "learning_rate": 0.0004934745762711864,
      "loss": NaN,
      "num_parameters": 5920442,
      "step": 22950
    },

EDIT2: I think that my issue is related to the scheduler in the learning rate. I am trying to train in batches of 20% of the dataset, so the learning rate scheduler I think, it calculate the learning rate based on the epoch and not on the current step, so I hardcoded in:

self.lr_scheduler = get_scheduler(
                self.args.lr_scheduler_type,
                self.optimizer,
                num_warmup_steps=self.args.warmup_steps,
                num_training_steps=num_training_steps, # <- here I hardcoded the calculated final (20%+20%+20%...) training steps
            )

So when I was approximating the final of the training in the first 20% it got something weird.

gaceladri commented 3 years ago

it's a pain to train on shards of (bookcorpus + wikipedia + openwebtext) I am processing the 20% of each one because I dont have more than 1 TB of disk. But I am figthing with the learning rate scheduler, because I have to do engineering to train on all the dataset.

mansimane commented 3 years ago

Thank you @NielsRogge . I was able to train DeBERTa with run_mlm.py script. Not sure what was the issue in my code, it gave nan after trying learning rate that you used as well.

jaideep11061982 commented 1 year ago

@mansimane are you using fp16 or fp32 ?

huggingface / transformers