Closed Javier-Jimenez99 closed 3 years ago
Hi there! The code is incomplete as we have no idea of what your dataset and model is. From the error message it looks like the problem is in the logits, so we would need the model to be able to reproduce the error.
Here is the full code:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification,Trainer, TrainingArguments
import json
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from transformers.trainer_callback import EarlyStoppingCallback
class dataset(Dataset):
def __init__(self,data,labels,tokenizer):
self.data = data
self.labels = labels
self.tokenizer= tokenizer
def processText(self,text):
return self.tokenizer(text, truncation=True)
def __len__(self):
return len(self.data.index)
def __getitem__(self,i):
row = self.data.iloc[i]
x = self.processText(self.data.iloc[i]['x']).data
try:
y = self.labels.index(self.data.iloc[i]['y'])
except:
y = len(self.labels) - 1
x['label'] = y
return x
def getLabels(data,nLabels):
serie = data.pivot_table(index=['y'], aggfunc='size')
labelsList = serie.sort_values(ascending=False).index.values.tolist()
return labelsList[0:nLabels-1] + ["OTHER"]
def accuracy(evalPrediction):
yPred = evalPrediction.predictions
yTrue = evalPrediction.label_ids
return {'accuracy':(yPred == yTrue).mean()}
df = pd.read_csv("/content/drive/MyDrive/SNOMED/Biopsias_HUPM_2010-2018_mor_codes-v1.csv",low_memory=False)
df = df[["Diagnostico", "CodOrgano"]]
data = df.rename(columns = {'Diagnostico':'x','CodOrgano':'y'})
data = data.dropna().reset_index(drop=True)
#df = df.iloc[:1000,:]
index = df.index
N = len(index)
P = 0.7
limit = round(N*P)
trainData = data.iloc[:limit,:]
validationData = data.iloc[limit:,:]
nLabels = 51
labels = getLabels(data,nLabels)
model = AutoModelForSequenceClassification.from_pretrained('dccuchile/bert-base-spanish-wwm-uncased',num_labels = nLabels)
tokenizer = AutoTokenizer.from_pretrained('dccuchile/bert-base-spanish-wwm-uncased',model_max_length = 128, use_fast=True)
trainDataset = dataset(trainData,labels,tokenizer)
validationDataset = dataset(validationData,labels,tokenizer)
args = TrainingArguments("/content/drive/MyDrive/SNOMED/TrainingLog",
learning_rate = 0.0003,
num_train_epochs = 10,
per_device_train_batch_size = 32,
per_device_eval_batch_size = 32,
evaluation_strategy = "epoch",
label_names = labels,
disable_tqdm = False,
dataloader_num_workers = 6,
load_best_model_at_end = True,
metric_for_best_model = "accuracy",
greater_is_better = True)
print("\nDEVICE:",args.device)
callbacks = [EarlyStoppingCallback(2,0.8)]
trainer = Trainer(model,
args = args,
train_dataset = trainDataset,
eval_dataset = validationDataset,
tokenizer = tokenizer,
callbacks = callbacks,
compute_metrics = accuracy)
trainer.train()
Here is the notebook where it can be checked easily: https://colab.research.google.com/drive/1VCacM-CDl2xrIFfwsrkmEh-D0IswK61D?usp=sharing
I'm not sure but, do the model need return_dict = True
?
One thing that may be linked to this is the label_names = labels
in your training arguments. label_names
is the name(s) of the field containing your labels. In this case, the default (which is ["labels"]
) is what you want, so you should leave it as is.
I changed my dataset to save the label on "labels" and it worked. It was a really silly problem, thank you so much!!
The same silly problem happens on me, thx a lot!!!!!!!!!!!!!😵💫
Environment info
transformers
version: 4.1.0.dev0@sgugger
Information
I'm using BERT for sequence classification. I have build my own pytorch dataset, with my data. During training there is no problem, but when it starts evaluation it gives an error with the following message:
To reproduce
Here is the code I used:
Both datasets have the same structure. Each item has the
BatchEncoding.data
dict, with a field 'label' added.Expected behavior
It should do the evaluation step correctly.