huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

Trainer bug? Loss and logits are “nan” when fine-tuning NLI model (both RoBERTa/BART) #9160

Closed MoritzLaurer closed 3 years ago

MoritzLaurer commented 3 years ago

Environment info

Who can help

@sgugger

Information

The problem arises when using:

The tasks I am working on is:

Description: I’m trying to fine-tune a pre-trained NLI model (ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli) on a dataset of around 276.000 hypothesis-premise pairs. I’m following the instructions from the docs here and here. When I run the training, it seems like the fine-tuning works (it does the training and saves the checkpoints), but trainer.train() and trainer.evaluate() return "nan" as loss value.

What I've tried:

=> I would be very thankful for any help on this! (I've been trying to solve this since two days now) Thanks a lot in advance.

To reproduce

Here is my code:

### load model & tokenize
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

max_length = 256
hg_model_hub_name = "ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli"
# also tried: hg_model_hub_name = "facebook/bart-large-mnli"
tokenizer = AutoTokenizer.from_pretrained(hg_model_hub_name)
model = AutoModelForSequenceClassification.from_pretrained(hg_model_hub_name)
model.config

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
if device == "cuda":
  model = model.half()
model.to(device)
model.train();

Running a test inference with the model at this point works fine:

test_enc = tokenizer(nli_train[0]["premise"], nli_train[0]["hypothesis"], return_tensors="pt", max_length=max_length,
                            return_token_type_ids=True, truncation=True, padding=True)
model.eval();
test_output_loss = model(test_enc["input_ids"].to(device), attention_mask=test_enc["attention_mask"].to(device), token_type_ids=test_enc["token_type_ids"].to(device), labels=torch.tensor(2).to(device))
print(test_output_loss)
#output: SequenceClassifierOutput(loss=tensor(2.2168, device='cuda:0', dtype=torch.float16, grad_fn=<NllLossBackward>), logits=tensor([[ 0.4075,  0.8511, -0.7549]], device='cuda:0', dtype=torch.float16,
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

Then I continue with preprocessing and training:

... some data preprocessing

encodings_train = tokenizer(premise_train, hypothesis_train, return_tensors="pt", max_length=max_length,
                            return_token_type_ids=True, truncation=False, padding=True)
encodings_val = tokenizer(premise_val, hypothesis_val, return_tensors="pt", max_length=max_length,
                          return_token_type_ids=True, truncation=False, padding=True)
encodings_test = tokenizer(premise_test, hypothesis_test, return_tensors="pt", max_length=max_length,
                           return_token_type_ids=True, truncation=False, padding=True)

### create pytorch dataset object
class XDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.as_tensor(val[idx]) for key, val in self.encodings.items()}
        #item = {key: torch.as_tensor(val[idx]).to(device) for key, val in self.encodings.items()}
        item['labels'] = torch.as_tensor(self.labels[idx])
        #item['labels'] = self.labels[idx]
        return item
    def __len__(self):
        return len(self.labels)

dataset_train = XDataset(encodings_train, label_train)
dataset_val = XDataset(encodings_val, label_val)
dataset_test = XDataset(encodings_test, label_test)

# compute metrics with trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def compute_metrics(pred):
    labels = pred.label_ids
    print(labels)
    preds = pred.predictions.argmax(-1)
    print(preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary', pos_label=0)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## training
from transformers import Trainer, TrainingArguments

# https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train,         # training dataset
    eval_dataset=dataset_val             # evaluation dataset
)

trainer.train()
# output: TrainOutput(global_step=181, training_loss=nan)
trainer.evaluate()
# output: 
[2 2 2 0 0 2 2 2 0 2 0 0 2 2 2 2 0 2 0 2 2 2 2 0 2 0 2 0 0 2 0 0 2 0 0 0 2
 0 2 0 0 0 0 0 2 0 0 2 2 2 0 2 2 2 2 2 0 0 0 0 2 0 0 0 2 2 0 0 0 2 0 0 0 2
 2 0 2 0 0 2 2 2 0 2 2 0 0 0 0 0 0 0 2 0 0 0 0 2 0 2 2 0 2 0 0 2 2 2 2 2 2
 2 0 0 0 0 2 0 0 2 0 0 0 0 2 2 2 0 0 0 0 0 2 0 0 2 0 2 0 2 0 2 0 0 2 2 0 0
 2 2 2 2 2 2 0 0 2 2 2 2 0 2 0 0 2 2 2 0 0 2 0 2 0 2 0 0 0 0 0 0 2 0 0 2 2
 0 2 2 2 0 2 2 0 2 2 2 2 2 2 0 0 2 0 0 2 2 0 0 0 2 0 2 2 2 0 0 0 0 0 0 0 0
 2 0 2 2 2 0 2 0 0 2 0 2 2 0 0 0 0 2 2 2 0 0 0 2 2 2 2 0 2 0 2 2 2]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

{'epoch': 1.0,
 'eval_accuracy': 0.5137254901960784,
 'eval_f1': 0.6787564766839378,
 'eval_loss': nan,
 'eval_precision': 0.5137254901960784,
 'eval_recall': 1.0}

Test running the model again after training, returns tensor([[nan, nan, nan]] for some reason:

test_enc = tokenizer(nli_train[0]["premise"], nli_train[0]["hypothesis"], return_tensors="pt", max_length=max_length,
                            return_token_type_ids=True, truncation=True, padding=True)
model.eval();
test_output_loss = model(test_enc["input_ids"].to(device), attention_mask=test_enc["attention_mask"].to(device), token_type_ids=test_enc["token_type_ids"].to(device), labels=torch.tensor(2).to(device))
print(test_output_loss)
#output: SequenceClassifierOutput(loss=tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<NllLossBackward>), logits=tensor([[nan, nan, nan]], device='cuda:0', dtype=torch.float16,
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

Expected behavior

Model should not return "nan" for logits and return a loss value.

MoritzLaurer commented 3 years ago

Update: I reran the training in native PyTorch with the following code and I did not get the same issue. This means that there is some issue with the trainer?

import torch

class XDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        #item = {key: torch.as_tensor(val[idx]) for key, val in self.encodings.items()}
        item = {key: torch.as_tensor(val[idx]) for key, val in self.encodings.items()}
        #item = {key: torch.as_tensor(val[idx]).to(device) for key, val in self.encodings.items()}
        item['labels'] = torch.as_tensor(self.labels[idx])
        #item['labels'] = torch.LongTensor(self.labels[idx])
        #item['labels'] = self.labels[idx]
        return item
    def __len__(self):
        return len(self.labels)

dataset_train = XDataset(encodings_train, label_train)
dataset_val = XDataset(encodings_val, label_val)
dataset_test = XDataset(encodings_test, label_test)

from torch.utils.data import DataLoader
from transformers import AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
model.train()

train_loader = DataLoader(dataset_train, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(1):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)
        print(labels)
        outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)
        loss = outputs[0] # outputs.loss
        print(loss)
        loss.backward()
        optim.step()
# Output: it prints the labels and the loss correctly!
#tensor([2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 0, 0, 0, 2], device='cuda:0')
#tensor(0.6895, device='cuda:0', dtype=torch.float16, grad_fn=<NllLossBackward>) ....

When I rerun the model for a test inference after this native pytorch training step, it also returns logits and loss as expected (no "nan").

sgugger commented 3 years ago

In the first snippet of code you convert your whole model to FP16 with model.half() (this is not in your second snippet of code). This is not how mixed-precision training works and you should pass the flag fp16=True to your TrainingArguments.

MoritzLaurer commented 3 years ago

thanks, I don't know much about mixed-precision training (the only reason why I added model.half() is because I understood that it reduces memory usage). Now, when I add fp16=True, i get the error: ValueError: Attempting to unscale FP16 gradients. when running trainer.train()

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=30,
    fp16=True
)
MoritzLaurer commented 3 years ago

Cool, but when I remove the model.half(), it does return the loss, that's great!

sgugger commented 3 years ago

Yes you have to remove that line, that's what I was saying :-)

MoritzLaurer commented 3 years ago

Great, so I understand that I can use mixed precision training by simply passing the flag fp16=True without manual modifications to the model. Is there actually any good reason not to pass "fp16=True"? The articles on mixed precision training I've found seem to be very positive about it.

In any case, thanks for solving my issue! :)

sgugger commented 3 years ago

There are no reason not to use, no. Sometimes for debugging purposes or there may be one of the exotic models that don't support FP16, but in general, it's a good way to speed up training and saving GPU memory.

Closing the issues since it's solved!

artmatsak commented 2 years ago

there may be one of the exotic models that don't support FP16

That was my case with ltgoslo/norbert producing the nan loss with FP16. Setting fp16 to False solved the issue, thanks!