huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.4k stars 26.37k forks source link

_tokenizer.decode TypeError: 'list' object cannot be interpreted as an integer #16355

Closed luisgg98 closed 2 years ago

luisgg98 commented 2 years ago

Environment info

Models:

Library:

Information

Model I am using (Bert, XLNet ...): BERT 2 BERT FINETUNED FOR PARAPHRASING https://huggingface.co/mrm8488/bert2bert_shared-spanish-finetuned-paus-x-paraphrasing

Error

imagen

  File "bert2bert_paraphraser.py", line 204, in <module>
    paraphraser.train(True)
  File "bert2bert_paraphraser.py", line 107, in train
    trainer.train()
  File "/data/anaconda3/envs/motoria_paraphrase/lib/python3.7/site-packages/transformers/trainer.py", line 1399, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/data/anaconda3/envs/motoria_paraphrase/lib/python3.7/site-packages/transformers/trainer.py", line 1521, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/data/anaconda3/envs/motoria_paraphrase/lib/python3.7/site-packages/transformers/trainer_seq2seq.py", line 70, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/data/anaconda3/envs/motoria_paraphrase/lib/python3.7/site-packages/transformers/trainer.py", line 2165, in evaluate
    metric_key_prefix=metric_key_prefix,
  File "/data/anaconda3/envs/motoria_paraphrase/lib/python3.7/site-packages/transformers/trainer.py", line 2401, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "bert2bert_paraphraser.py", line 159, in compute_metrics
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
  File "/data/anaconda3/envs/motoria_paraphrase/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 3208, in batch_decode
    for seq in sequences
  File "/data/anaconda3/envs/motoria_paraphrase/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 3208, in <listcomp>
    for seq in sequences
  File "/data/anaconda3/envs/motoria_paraphrase/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 3244, in decode
    **kwargs,
  File "/data/anaconda3/envs/motoria_paraphrase/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 531, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
TypeError: 'list' object cannot be interpreted as an integer

To reproduce

import numpy as np
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, EarlyStoppingCallback
from transformers import EncoderDecoderModel
from paraphraser import *
import sys
import os
APP_ROOT = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, APP_ROOT)
sys.stdout.flush()
# STATIC VALUES
PATH_FILE_CSV_DATASETS =
PRETRAINED_MODEL = 
TRAIN_EPOCHS = 5
MAX_LEN = 512

encoder_max_length = MAX_LEN
decoder_max_length = MAX_LEN

# TODO Cambiar con el dataset correcto
# load rouge for validation
print("Load metrics, models and tokenizer")
model = EncoderDecoderModel.from_pretrained(PRETRAINED_MODEL)
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL)
# 2 - Prepare TrainingArguments
# Arguments for training

class Bert2BertParaphrase(Paraphraser):
    #PATH, local_files_only=True

    def __init__(self,
                 batch_size=2,
                 model_pretrained="bert2bert",
                 epoch_size=10,  # change to 16 for full training
                 number_of_steps=5000,
                 cuda_id=0
                 ):

        os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
        os.environ["CUDA_VISIBLE_DEVICES"] = str(cuda_id)
        # MODEL THAT IS GOING TO BE USED
        # I think it was mrm8488/bert2bert_shared-spanish-finetuned-paus-x-paraphrasing
        self.output_dir = str(model_pretrained + '-motoria-paraphrasing')

        print("Preparing arguments for training...")
        self.args = Seq2SeqTrainingArguments(
            output_dir=self.output_dir,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            #predict_with_generate=True,
            # evaluate_during_training=True,
            evaluation_strategy='steps',
            do_train=True,
            # do_train (bool, optional, defaults to False) — Whether to run training or not.
            #  This argument is not directly used by Trainer, it’s intended to be used by your training/evaluation scripts instead.
            #  See the example scripts for more details.
            do_eval=True,
            # do_eval (bool, optional) — Whether to run evaluation on the validation set or not.
            # Will be set to True if evaluation_strategy is different from "no".
            # This argument is not directly used by Trainer, it’s intended to be used by your training/evaluation scripts instead.
            # See the example scripts for more details.
            save_steps=number_of_steps,
            # max_steps=1500, # delete for full training
            overwrite_output_dir=True,
            save_total_limit=10,
            fp16=True,
            num_train_epochs=epoch_size,
            # fp16 (bool, optional, defaults to False) — Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
            load_best_model_at_end = True,
            push_to_hub=False,
            #metric_for_best_model="bleu",
            #eval_accumulation_steps=1,
            eval_steps=number_of_steps
        )

    def train(self, bool_local, dataset_path=PATH_FILE_CSV_DATASETS):
        print("Lets the training begin!")
        # 1 - Load BERT AS TOKENIZER
        # Loading the BERT Tokenizer
        # change to 16 for full training
        print("Preprocessing file...")
        tokenized_datasets = self.preprocess_datasets(dataset_path, bool_local)
        print("Preprocess done!")
        # Data collator is used to putting together all the examples inside a batch
        data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
        print("Preparing arguments")
        trainer = Seq2SeqTrainer(
            model,
            self.args,
            train_dataset=tokenized_datasets["train"],
            eval_dataset=tokenized_datasets["test"],
            data_collator=data_collator,
            tokenizer=tokenizer,
            compute_metrics=compute_metrics,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
        )
        print("It is time to process!")
        trainer.train()
        print("Training done")

        trainer.save_model(self.output_dir)

# 3 - PROCESS DATA AS BERT REQUIRES
    def preprocess_function(self, batch):
        # Tokenize the input and target data
        """
        Parameters according to:
        https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/encoder-decoder#transformers.EncoderDecoderModel
        """
        inputs = tokenizer(batch["source"], padding="max_length",
                           truncation=True, max_length=encoder_max_length)
        outputs = tokenizer(batch["target"], padding="max_length",
                            truncation=True, max_length=decoder_max_length)

        batch["input_ids"] = inputs.input_ids
        # input_ids (torch.LongTensor of shape (batch_size, sequence_length))
        # Indices of input sequence tokens in the vocabulary.
        batch["attention_mask"] = inputs.attention_mask
        # attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional)
        # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
        batch["decoder_input_ids"] = outputs.input_ids
        # decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional)
        # Indices of decoder input sequence tokens in the vocabulary.
        batch["decoder_attention_mask"] = outputs.attention_mask
        # decoder_attention_mask (torch.BoolTensor of shape (batch_size, target_sequence_length), optional)
        # Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
        batch["labels"] = outputs.input_ids.copy()
        # labels (torch.LongTensor of shape (batch_size, sequence_length), optional)
        # Labels for computing the masked language modeling loss for the decoder.
        # Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring)
        # Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
        batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels]
                           for labels in batch["labels"]]

        return batch

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds_bleu, decoded_labels_bleu = postprocess_text(
        decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds_bleu,
                            references=decoded_labels_bleu)
    meteor_result = meteor.compute(
        predictions=decoded_preds_bleu, references=decoded_labels_bleu)
    prediction_lens = [np.count_nonzero(
        pred != tokenizer.pad_token_id) for pred in preds]

    result = {'bleu': result['score']}

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip()))
                     for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip()))
                      for label in decoded_labels]
    result_rouge = rouge.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result["rouge"] = result_rouge['rougeL'].mid.fmeasure
    result["gen_len"] = np.mean(prediction_lens)
    result["meteor"] = meteor_result["meteor"]
    result = {k: round(v, 4) for k, v in result.items()}
    return result

if __name__ == "__main__":
    print(f"Arguments count: {len(sys.argv)}")
    for i, arg in enumerate(sys.argv):
        print(f"Argument {i:>3}: {arg}")

    paraphraser = Bert2BertParaphrase(
        batch_size=int(sys.argv[1]),
        epoch_size=int(sys.argv[2]),
        model_pretrained="BERT2BERT-model-"+str(sys.argv[1])+"-"+str(sys.argv[2]),
        cuda_id=0
    )
    paraphraser.train(True)

This function is found in other file.

    def preprocess_datasets(self,dataset_path, bool_local):
        """
        Read the CSV file which contains all the data and transforms
        it into a datasets divided by train set and test set
        """
        if bool_local:
            dataset_path = PAWS_X_CSV
            dataset_content = pd.read_csv(
                        dataset_path,engine='python',  sep="\;\;")
                        #"sentence1",   "sentence2",    "label" 
            dataset_paws_x = dataset_content[["sentence1",  "sentence2",    "label"]]
            dataset_paws_x.columns = ['source', 'target','label']

            dataset_paws_x.dropna(axis=0, how='any')
            dataset_paws_x = dataset_paws_x.loc[dataset_paws_x['label'] != "0"]

            df2= dataset_paws_x[['source', 'target']]

            dataset_path = FORTEC_DATASET
            dataset_conent = pd.read_csv(
                        dataset_path, engine='python', header=None, sep="\;\;")
            created_dataset = dataset_conent[[0, 1]]
            created_dataset.columns = ['source', 'target']
            created_dataset.dropna(axis=0, how='any')
            df1= created_dataset.loc[created_dataset['target'] != None]

            frames = [df1, df2]

            dataset = pd.concat(frames)
            dataset = dataset[['source', 'target']]

            dataset = Dataset.from_pandas(dataset)
            dataset = dataset.remove_columns('__index_level_0__')

            dataset = dataset.filter(
                lambda example: example['target'] != None)

            dataset = dataset.filter(
                lambda example: example['source'] != None)

            print(dataset)

            train_testvalid = dataset.train_test_split(test_size=0.1)
            print(train_testvalid)

            tokenized_dataset = train_testvalid.map(
                self.preprocess_function, batched=True)
        else:
            dataset = load_dataset(dataset_path, 'labeled_final')
            dataset_paraphrase = dataset.filter(
                lambda example: example['label'] > 0)
            dataset_paraphrase = dataset_paraphrase.remove_columns("label")
            dataset_paraphrase = dataset_paraphrase.remove_columns("id")
            dataset_paraphrase.shuffle(seed=42, buffer_size=10_000)
            tokenized_dataset = dataset_paraphrase.map(
                self.preprocess_function, batched=True)

        return tokenized_dataset

Problem Description

Hello! Sorry but I've recently started to work with language models. I apologize in case I've forgotten to include something which may help you to help me with this issue. I was fine tunning the model https://huggingface.co/mrm8488/bert2bert_shared-spanish-finetuned-paus-x-paraphrasing while in the evaluating process I got the error shown in the screenshot. I don't actually understand why this is happening, because I have built the compute metrics function according to this tutorial https://neptune.ai/blog/hugging-face-pre-trained-models-find-the-best

Sorry if it is a naive mistake, I don't mean to bother but I don't know how I can solve it. Thank you so much!

LysandreJik commented 2 years ago

Hello, thanks for opening an issue! We try to keep the github issues for bugs/feature requests. Could you ask your question on the forum for additional chances of getting an answer?

Thanks!

luisgg98 commented 2 years ago

@LysandreJik Sorry I didn't know the existence of the forum, I will write a post on there as well. Thank you!

luisgg98 commented 2 years ago

Reading through the forum of HuggingFace Community I could find the Solution here What's wrong with my code is that I completely forgot to set the value predict_with_generate toTrue

predict_with_generate=True

Now it is traning perfectly, thank you!