huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134k stars 26.79k forks source link

Obfuscated text classification error when using CANINE Transformers #13736

Closed santhoshcameo closed 3 years ago

santhoshcameo commented 3 years ago

Environment info

Who can help

@NielsRogge @patrickvonplaten

Models:

Information

Model I am using is CANINE

I have a obfuscated documents consists around 30000 sentences and each has some labels too (in total 11 labels) - Multi Class Classification problem (The data has been obfuscated, however the patterns in them are preserved)

A single record look like this:

satwamuluhqgulamlrmvezuhqvkrpmletwulcitwskuhlemvtwamuluhiwiwenuhlrvimvqvkruhulenamuluhqgqvtwvimviwuhtwamuluhulqvkrenamcitwuhvipmpmqvuhskiwkrpmdfuhlrvimvskvikrpmqvuhskmvgzenleuhqvmvamuluhulenamuluhqvletwtwvipmpmgzleenamuhtwamuluhtwletwdfuhiwkrxeleentwxeuhpmqvuhtwiwmvamdfuhpkeztwamuluhvimvuhqvtwmkpmpmlelruhgztwtwskuhtwlrkrpmlruhpmuluhqvenuhtwyplepmxeuhenuhamypkrqvuhamulmvdfuhqvskentwamletwlrlrpmiwuhtwamul

So I am decided to try the CANINE since its works on the character encoding principle. But i am facing some issues, I have attached the code and exceptions.

with open('xtrain_obfuscated.txt') as f:
    x = f.read().splitlines()
with open('ytrain.txt') as f:
    y = f.read().splitlines()

import torch
from transformers import CanineConfig, CanineForSequenceClassification, CanineForMultipleChoice, CanineForTokenClassification

from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2)
from transformers import CanineTokenizer, CanineModel
from transformers import Trainer, TrainingArguments, CanineForMultipleChoice
tokenizer = CanineTokenizer(model_max_length=512)

tokens_train = tokenizer(x_train, padding='longest', return_tensors='pt')
tokens_val = tokenizer(x_val, padding='longest', return_tensors='pt')

class NovelClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(int(self.labels[idx]))

        return item

    def __len__(self):
        #print(len(self.labels))
        return len(self.labels)

train_dataset = NovelClassificationDataset(tokens_train, y_train)
val_dataset = NovelClassificationDataset(tokens_val, y_val)
model = CanineForSequenceClassification.from_pretrained("google/canine-s", num_labels=12, problem_type="multi_label_classification")

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=10,              # total number of training epochs
    per_device_train_batch_size=13,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

Exception is

~/opt/anaconda3/envs/task/lib/python3.8/site-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
   2578 
   2579     if not (target.size() == input.size()):
-> 2580         raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
   2581 
   2582     return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)

ValueError: Target size (torch.Size([13])) must be the same as input size (torch.Size([13, 12]))

Expected behavior

Expected output is a multiclass classification model using CANINE Model where I should be able to get a prediction from test data set (obfuscated test data set).

Please advice.

NielsRogge commented 3 years ago

You're initializing CanineModel, which doesn't accept a labels argument.

You probably want to use CanineForSequenceClassification, which is CANINE with a sequence classification head on top.

Also, please use the forum for training-related questions.