MoritzLaurer commented 3 years ago

Environment info

transformers version: 4.0.1 (also reproduced the same issue with 3.5.1)
Platform: Google Colab
Python version: 3.6
PyTorch version (GPU?): 1.7.0+cu101
Tensorflow version (GPU?): no
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: (don't know. Probably not)

Who can help

@sgugger

Information

The problem arises when using:

[ x] my own modified scripts: (give details below)

The tasks I am working on is:

[ x] my own task or dataset: (give details below)

Description: I’m trying to fine-tune a pre-trained NLI model (ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli) on a dataset of around 276.000 hypothesis-premise pairs. I’m following the instructions from the docs here and here. When I run the training, it seems like the fine-tuning works (it does the training and saves the checkpoints), but trainer.train() and trainer.evaluate() return "nan" as loss value.

What I've tried:

I tried using both ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli and facebook/bart-large-mnli to make sure that it's not linked to specific model, but I get the issue for both models
I tried following the advice in this related github issue, but adding num_labels=3 to the config file does not solve the issue. (I think my issue is different because the models are already fine-tuned on NLI in my case)
I tried many different ways of changing my input data because I suspected that there could be an issue with my input data, but I also couldn't solve it that way.
The probable source of the issue: I inspected the prediction output from the model during training and the weird thing is that the prediction value always seems to be "0" (entailment) in 100% of cases (see printed output at the bottom of the code below). This cannot be right. Even weirder: When I first run the model to predict a test sequences before running the trainer, I get normal logits as output. When I run the exact same code block again at the end after having run the trainer, I get tensor([[nan, nan, nan]] as output (see code below).
I suspect that the source for the 'only 0 prediction output' is that the logits the model returns during training are possibly always torch.tensor([[np.nan, np.nan, np.nan]]). torch.tensor([[np.nan, np.nan, np.nan]]).argmax(-1) returns torch.tensor(0) without triggering an error. The big mystery for me is why the logits would become "nan", because the model does not do that when I use the same input data only outside of the trainer, but something changes once I've run the trainer.

=> I would be very thankful for any help on this! (I've been trying to solve this since two days now) Thanks a lot in advance.

To reproduce

Here is my code:

### load model & tokenize
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

max_length = 256
hg_model_hub_name = "ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli"
# also tried: hg_model_hub_name = "facebook/bart-large-mnli"
tokenizer = AutoTokenizer.from_pretrained(hg_model_hub_name)
model = AutoModelForSequenceClassification.from_pretrained(hg_model_hub_name)
model.config

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
if device == "cuda":
  model = model.half()
model.to(device)
model.train();

Running a test inference with the model at this point works fine:

test_enc = tokenizer(nli_train[0]["premise"], nli_train[0]["hypothesis"], return_tensors="pt", max_length=max_length,
                            return_token_type_ids=True, truncation=True, padding=True)
model.eval();
test_output_loss = model(test_enc["input_ids"].to(device), attention_mask=test_enc["attention_mask"].to(device), token_type_ids=test_enc["token_type_ids"].to(device), labels=torch.tensor(2).to(device))
print(test_output_loss)
#output: SequenceClassifierOutput(loss=tensor(2.2168, device='cuda:0', dtype=torch.float16, grad_fn=<NllLossBackward>), logits=tensor([[ 0.4075,  0.8511, -0.7549]], device='cuda:0', dtype=torch.float16,
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

Then I continue with preprocessing and training:

... some data preprocessing

encodings_train = tokenizer(premise_train, hypothesis_train, return_tensors="pt", max_length=max_length,
                            return_token_type_ids=True, truncation=False, padding=True)
encodings_val = tokenizer(premise_val, hypothesis_val, return_tensors="pt", max_length=max_length,
                          return_token_type_ids=True, truncation=False, padding=True)
encodings_test = tokenizer(premise_test, hypothesis_test, return_tensors="pt", max_length=max_length,
                           return_token_type_ids=True, truncation=False, padding=True)

### create pytorch dataset object
class XDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.as_tensor(val[idx]) for key, val in self.encodings.items()}
        #item = {key: torch.as_tensor(val[idx]).to(device) for key, val in self.encodings.items()}
        item['labels'] = torch.as_tensor(self.labels[idx])
        #item['labels'] = self.labels[idx]
        return item
    def __len__(self):
        return len(self.labels)

dataset_train = XDataset(encodings_train, label_train)
dataset_val = XDataset(encodings_val, label_val)
dataset_test = XDataset(encodings_test, label_test)

# compute metrics with trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def compute_metrics(pred):
    labels = pred.label_ids
    print(labels)
    preds = pred.predictions.argmax(-1)
    print(preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary', pos_label=0)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## training
from transformers import Trainer, TrainingArguments

# https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train,         # training dataset
    eval_dataset=dataset_val             # evaluation dataset
)

trainer.train()
# output: TrainOutput(global_step=181, training_loss=nan)
trainer.evaluate()
# output: 
[2 2 2 0 0 2 2 2 0 2 0 0 2 2 2 2 0 2 0 2 2 2 2 0 2 0 2 0 0 2 0 0 2 0 0 0 2
 0 2 0 0 0 0 0 2 0 0 2 2 2 0 2 2 2 2 2 0 0 0 0 2 0 0 0 2 2 0 0 0 2 0 0 0 2
 2 0 2 0 0 2 2 2 0 2 2 0 0 0 0 0 0 0 2 0 0 0 0 2 0 2 2 0 2 0 0 2 2 2 2 2 2
 2 0 0 0 0 2 0 0 2 0 0 0 0 2 2 2 0 0 0 0 0 2 0 0 2 0 2 0 2 0 2 0 0 2 2 0 0
 2 2 2 2 2 2 0 0 2 2 2 2 0 2 0 0 2 2 2 0 0 2 0 2 0 2 0 0 0 0 0 0 2 0 0 2 2
 0 2 2 2 0 2 2 0 2 2 2 2 2 2 0 0 2 0 0 2 2 0 0 0 2 0 2 2 2 0 0 0 0 0 0 0 0
 2 0 2 2 2 0 2 0 0 2 0 2 2 0 0 0 0 2 2 2 0 0 0 2 2 2 2 0 2 0 2 2 2]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

{'epoch': 1.0,
 'eval_accuracy': 0.5137254901960784,
 'eval_f1': 0.6787564766839378,
 'eval_loss': nan,
 'eval_precision': 0.5137254901960784,
 'eval_recall': 1.0}

Test running the model again after training, returns tensor([[nan, nan, nan]] for some reason:

test_enc = tokenizer(nli_train[0]["premise"], nli_train[0]["hypothesis"], return_tensors="pt", max_length=max_length,
                            return_token_type_ids=True, truncation=True, padding=True)
model.eval();
test_output_loss = model(test_enc["input_ids"].to(device), attention_mask=test_enc["attention_mask"].to(device), token_type_ids=test_enc["token_type_ids"].to(device), labels=torch.tensor(2).to(device))
print(test_output_loss)
#output: SequenceClassifierOutput(loss=tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<NllLossBackward>), logits=tensor([[nan, nan, nan]], device='cuda:0', dtype=torch.float16,
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

Expected behavior

Model should not return "nan" for logits and return a loss value.

MoritzLaurer commented 3 years ago

Update: I reran the training in native PyTorch with the following code and I did not get the same issue. This means that there is some issue with the trainer?

import torch

class XDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        #item = {key: torch.as_tensor(val[idx]) for key, val in self.encodings.items()}
        item = {key: torch.as_tensor(val[idx]) for key, val in self.encodings.items()}
        #item = {key: torch.as_tensor(val[idx]).to(device) for key, val in self.encodings.items()}
        item['labels'] = torch.as_tensor(self.labels[idx])
        #item['labels'] = torch.LongTensor(self.labels[idx])
        #item['labels'] = self.labels[idx]
        return item
    def __len__(self):
        return len(self.labels)

dataset_train = XDataset(encodings_train, label_train)
dataset_val = XDataset(encodings_val, label_val)
dataset_test = XDataset(encodings_test, label_test)

from torch.utils.data import DataLoader
from transformers import AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
model.train()

train_loader = DataLoader(dataset_train, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(1):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)
        print(labels)
        outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)
        loss = outputs[0] # outputs.loss
        print(loss)
        loss.backward()
        optim.step()
# Output: it prints the labels and the loss correctly!
#tensor([2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 0, 0, 0, 2], device='cuda:0')
#tensor(0.6895, device='cuda:0', dtype=torch.float16, grad_fn=<NllLossBackward>) ....

When I rerun the model for a test inference after this native pytorch training step, it also returns logits and loss as expected (no "nan").

sgugger commented 3 years ago

In the first snippet of code you convert your whole model to FP16 with model.half() (this is not in your second snippet of code). This is not how mixed-precision training works and you should pass the flag fp16=True to your TrainingArguments.

MoritzLaurer commented 3 years ago

thanks, I don't know much about mixed-precision training (the only reason why I added model.half() is because I understood that it reduces memory usage). Now, when I add fp16=True, i get the error: ValueError: Attempting to unscale FP16 gradients. when running trainer.train()

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=30,
    fp16=True
)

MoritzLaurer commented 3 years ago

Cool, but when I remove the model.half(), it does return the loss, that's great!

sgugger commented 3 years ago

Yes you have to remove that line, that's what I was saying :-)

MoritzLaurer commented 3 years ago

Great, so I understand that I can use mixed precision training by simply passing the flag fp16=True without manual modifications to the model. Is there actually any good reason not to pass "fp16=True"? The articles on mixed precision training I've found seem to be very positive about it.

In any case, thanks for solving my issue! :)

sgugger commented 3 years ago

There are no reason not to use, no. Sometimes for debugging purposes or there may be one of the exotic models that don't support FP16, but in general, it's a good way to speed up training and saving GPU memory.

Closing the issues since it's solved!

artmatsak commented 2 years ago

there may be one of the exotic models that don't support FP16

That was my case with ltgoslo/norbert producing the nan loss with FP16. Setting fp16 to False solved the issue, thanks!

huggingface / transformers

Trainer bug? Loss and logits are “nan” when fine-tuning NLI model (both RoBERTa/BART) #9160

Environment info

Who can help

Information

To reproduce

Here is my code:

... some data preprocessing

Expected behavior