Error during validation Trainer step

Javier-Jimenez99 commented 3 years ago

Environment info

transformers version: 4.1.0.dev0
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.7.0+cu101 (True)
Tensorflow version (GPU?): 2.3.0 (True)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes (Dataloaders)

@sgugger

Information

I'm using BERT for sequence classification. I have build my own pytorch dataset, with my data. During training there is no problem, but when it starts evaluation it gives an error with the following message:

/usr/local/lib/python3.6/dist-packages/transformers/trainer.py in train(self, model_path, trial)
    801 
    802             self.control = self.callback_handler.on_epoch_end(self.args, self.state, self.control)
--> 803             self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
    804 
    805             if self.args.tpu_metrics_debug or self.args.debug:

/usr/local/lib/python3.6/dist-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch)
    863         metrics = None
    864         if self.control.should_evaluate:
--> 865             metrics = self.evaluate()
    866             self._report_to_hp_search(trial, epoch, metrics)
    867 

/usr/local/lib/python3.6/dist-packages/transformers/trainer.py in evaluate(self, eval_dataset, ignore_keys)
   1278             # self.args.prediction_loss_only
   1279             prediction_loss_only=True if self.compute_metrics is None else None,
-> 1280             ignore_keys=ignore_keys,
   1281         )
   1282 

/usr/local/lib/python3.6/dist-packages/transformers/trainer.py in prediction_loop(self, dataloader, description, prediction_loss_only, ignore_keys)
   1387                 losses_host = losses if losses_host is None else torch.cat((losses_host, losses), dim=0)
   1388             if logits is not None:
-> 1389                 preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
   1390             if labels is not None:
   1391                 labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100)

/usr/local/lib/python3.6/dist-packages/transformers/trainer_pt_utils.py in nested_concat(tensors, new_tensors, padding_index)
     82     ), f"Expected `tensors` and `new_tensors` to have the same type but found {type(tensors)} and {type(new_tensors)}."
     83     if isinstance(tensors, (list, tuple)):
---> 84         return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
     85     elif isinstance(tensors, torch.Tensor):
     86         return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)

/usr/local/lib/python3.6/dist-packages/transformers/trainer_pt_utils.py in <genexpr>(.0)
     82     ), f"Expected `tensors` and `new_tensors` to have the same type but found {type(tensors)} and {type(new_tensors)}."
     83     if isinstance(tensors, (list, tuple)):
---> 84         return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
     85     elif isinstance(tensors, torch.Tensor):
     86         return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)

/usr/local/lib/python3.6/dist-packages/transformers/trainer_pt_utils.py in nested_concat(tensors, new_tensors, padding_index)
     84         return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
     85     elif isinstance(tensors, torch.Tensor):
---> 86         return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
     87     elif isinstance(tensors, np.ndarray):
     88         return numpy_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)

/usr/local/lib/python3.6/dist-packages/transformers/trainer_pt_utils.py in torch_pad_and_concatenate(tensor1, tensor2, padding_index)
     45 def torch_pad_and_concatenate(tensor1, tensor2, padding_index=-100):
     46     """Concatenates `tensor1` and `tensor2` on first axis, applying padding on the second if necessary."""
---> 47     if len(tensor1.shape) == 1 or tensor1.shape[1] == tensor2.shape[1]:
     48         return torch.cat((tensor1, tensor2), dim=0)
     49 

IndexError: tuple index out of range

To reproduce

Here is the code I used:

args = TrainingArguments("/content/drive/MyDrive/SNOMED/TrainingLog",
                         learning_rate = 0.0003,
                         num_train_epochs = 10,
                         per_device_train_batch_size = 32,
                         per_device_eval_batch_size = 32,
                         evaluation_strategy = "epoch",
                         label_names = labels,
                         disable_tqdm = False,
                         dataloader_num_workers = 6,
                         load_best_model_at_end = True,
                         metric_for_best_model = "accuracy",
                         greater_is_better = True)

print("\nDEVICE:",args.device)

callbacks = [EarlyStoppingCallback(2,0.8)]

trainer = Trainer(model,
                  args = args,
                  train_dataset = trainDataset, 
                  eval_dataset = validationDataset,
                  tokenizer = tokenizer, 
                  callbacks = callbacks,
                  compute_metrics = accuracy)

trainer.train()

Both datasets have the same structure. Each item has the BatchEncoding.data dict, with a field 'label' added.

Expected behavior

It should do the evaluation step correctly.

sgugger commented 3 years ago

Hi there! The code is incomplete as we have no idea of what your dataset and model is. From the error message it looks like the problem is in the logits, so we would need the model to be able to reproduce the error.

Javier-Jimenez99 commented 3 years ago

Here is the full code:

import torch 
from transformers import AutoTokenizer, AutoModelForSequenceClassification,Trainer, TrainingArguments
import json
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from transformers.trainer_callback import EarlyStoppingCallback

class dataset(Dataset):
    def __init__(self,data,labels,tokenizer):
        self.data = data
        self.labels = labels
        self.tokenizer= tokenizer

    def processText(self,text):
        return self.tokenizer(text, truncation=True)

    def __len__(self):
        return len(self.data.index)

    def __getitem__(self,i):
        row = self.data.iloc[i]
        x = self.processText(self.data.iloc[i]['x']).data

        try:
            y = self.labels.index(self.data.iloc[i]['y'])
        except:
            y = len(self.labels) - 1 

        x['label'] = y
        return x

def getLabels(data,nLabels):
    serie = data.pivot_table(index=['y'], aggfunc='size')

    labelsList = serie.sort_values(ascending=False).index.values.tolist() 

    return labelsList[0:nLabels-1] + ["OTHER"]

def accuracy(evalPrediction):
    yPred = evalPrediction.predictions
    yTrue = evalPrediction.label_ids

    return {'accuracy':(yPred == yTrue).mean()}

df = pd.read_csv("/content/drive/MyDrive/SNOMED/Biopsias_HUPM_2010-2018_mor_codes-v1.csv",low_memory=False)
df = df[["Diagnostico", "CodOrgano"]]

data = df.rename(columns = {'Diagnostico':'x','CodOrgano':'y'})
data = data.dropna().reset_index(drop=True)

#df = df.iloc[:1000,:]

index = df.index
N = len(index)
P = 0.7
limit = round(N*P)

trainData = data.iloc[:limit,:]
validationData = data.iloc[limit:,:]

nLabels = 51

labels = getLabels(data,nLabels)

model = AutoModelForSequenceClassification.from_pretrained('dccuchile/bert-base-spanish-wwm-uncased',num_labels = nLabels)
tokenizer = AutoTokenizer.from_pretrained('dccuchile/bert-base-spanish-wwm-uncased',model_max_length = 128, use_fast=True)

trainDataset = dataset(trainData,labels,tokenizer)
validationDataset = dataset(validationData,labels,tokenizer)

args = TrainingArguments("/content/drive/MyDrive/SNOMED/TrainingLog",
                         learning_rate = 0.0003,
                         num_train_epochs = 10,
                         per_device_train_batch_size = 32,
                         per_device_eval_batch_size = 32,
                         evaluation_strategy = "epoch",
                         label_names = labels,
                         disable_tqdm = False,
                         dataloader_num_workers = 6,
                         load_best_model_at_end = True,
                         metric_for_best_model = "accuracy",
                         greater_is_better = True)

print("\nDEVICE:",args.device)

callbacks = [EarlyStoppingCallback(2,0.8)]

trainer = Trainer(model,
                  args = args,
                  train_dataset = trainDataset, 
                  eval_dataset = validationDataset,
                  tokenizer = tokenizer, 
                  callbacks = callbacks,
                  compute_metrics = accuracy)

trainer.train()

Here is the notebook where it can be checked easily: https://colab.research.google.com/drive/1VCacM-CDl2xrIFfwsrkmEh-D0IswK61D?usp=sharing

I'm not sure but, do the model need return_dict = True?

sgugger commented 3 years ago

One thing that may be linked to this is the label_names = labels in your training arguments. label_names is the name(s) of the field containing your labels. In this case, the default (which is ["labels"]) is what you want, so you should leave it as is.

Javier-Jimenez99 commented 3 years ago

I changed my dataset to save the label on "labels" and it worked. It was a really silly problem, thank you so much!!

yanyc428 commented 2 years ago

The same silly problem happens on me, thx a lot!!!!!!!!!!!!!😵‍💫

huggingface / transformers