huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.08k stars 27.03k forks source link

[run_translation.py] out of range integral type conversion attempted #22634

Closed stas00 closed 1 year ago

stas00 commented 1 year ago

spitting off from https://github.com/huggingface/transformers/issues/22571 as it was a secondary problem reported there:

Reproduction

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=src python examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-base --do_train --do_eval --source_lang en \
--target_lang de --source_prefix 'translate English to German: ' \
--dataset_name stas/wmt14-en-de-pre-processed --output_dir \
/tmp/tst-translation --num_train_epochs 1 --per_device_train_batch_size=1 \
--max_train_samples 10 --overwrite_output_dir --seed 1137 \
--per_device_eval_batch_size 1 --predict_with_generate --fp16 \
--max_eval_samples 10

fails inside eval:

[INFO|trainer.py:3126] 2023-04-04 09:28:07,548 >> ***** Running Evaluation *****
[INFO|trainer.py:3128] 2023-04-04 09:28:07,548 >>   Num examples = 10
[INFO|trainer.py:3131] 2023-04-04 09:28:07,548 >>   Batch size = 1
[INFO|configuration_utils.py:575] 2023-04-04 09:28:07,552 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.28.0.dev0"
}

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:02<00:00,  3.72it/s]Traceback (most recent call last):
  File "examples/pytorch/translation/run_translation.py", line 664, in <module>
    main()
  File "examples/pytorch/translation/run_translation.py", line 605, in main
    metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer_seq2seq.py", line 159, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 2990, in evaluate
    output = eval_loop(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 3278, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "examples/pytorch/translation/run_translation.py", line 546, in compute_metrics
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 3445, in batch_decode
    return [
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 3446, in <listcomp>
    self.decode(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 3485, in decode
    return self._decode(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted

@sgugger

sgugger commented 1 year ago

cc @ArthurZucker Seems like the model and tokenizer have mismatched length

ArthurZucker commented 1 year ago

Yeah, but :

File ~/Work/transformers/src/transformers/tokenization_utils_base.py:3485, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, kwargs) 3482 # Convert inputs to python lists 3483 token_ids = to_py_obj(token_ids) -> 3485 return self._decode( 3486 token_ids=token_ids, 3487 skip_special_tokens=skip_special_tokens, 3488 clean_up_tokenization_spaces=clean_up_tokenization_spaces, 3489 kwargs, 3490 )

File ~/Work/transformers/src/transformers/tokenization_utils_fast.py:549, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs) 547 if isinstance(token_ids, int): 548 token_ids = [token_ids] --> 549 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens) 551 clean_up_tokenization_spaces = ( 552 clean_up_tokenization_spaces 553 if clean_up_tokenization_spaces is not None 554 else self.clean_up_tokenization_spaces 555 ) 556 if clean_up_tokenization_spaces:

OverflowError: out of range integral type conversion attempted


That means you are juste giving a huge huge number to decode is there a reason ?
stas00 commented 1 year ago

Please note I've only relayed the errors reported on the pytorch Issued by a user trying to use torch.compile.

stefan-it commented 1 year ago

Hi guys,

I have the same problem with the run_seq2seq_qa.py script and it turns out, that preds are passed to the decode function, with the following content:

[[     0 250099   1013 ...   -100   -100   -100]  
 [     0 250099   1013 ...   -100   -100   -100]                          
 [     0 250099   1013 ...   -100   -100   -100]                          
 ...                                                                                     
 [     0 250099    260 ...   -100   -100   -100]              
 [     0 250099    442 ...   -100   -100   -100]
 [     0 250099   3883 ...   -100   -100   -100]]

So the problematic thing here is -100 I guess, because I can reproduce the error with:

>>> tokenizer.decode(-100)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/transformers/src/transformers/tokenization_utils_base.py", line 3485, in decode
    return self._decode(
  File "/home/ubuntu/transformers/src/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
ArthurZucker commented 1 year ago

Awsome thanks for providing this! Indeed these should be converted to padding

stas00 commented 1 year ago

Could it be similar to this fix? https://github.com/huggingface/transformers/pull/18592 The hardcoded -100 doesn't seem to always do the right thing.

stas00 commented 1 year ago

I tried with another model arch and it's breaks too but in another way. so eval is quite broken in many ways.

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=src python examples/pytorch/translation/run_translation.py --model_name_or_path  'facebook/wmt19-en-ru'  --do_train --do_eval --source_lang en --target_lang de --source_prefix 'translate English to German: ' --dataset_name stas/wmt14-en-de-pre-processed --output_dir /tmp/tst-translation --num_train_epochs 1 --per_device_train_batch_size=1 --max_train_samples 10 --overwrite_output_dir --seed 1137 --per_device_eval_batch_size 1 --predict_with_generate --fp16 --max_eval_samples 10

Traceback (most recent call last):
  File "examples/pytorch/translation/run_translation.py", line 664, in <module>
    main()
  File "examples/pytorch/translation/run_translation.py", line 605, in main
    metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer_seq2seq.py", line 159, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 2993, in evaluate
    output = eval_loop(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 3174, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer_seq2seq.py", line 290, in prediction_step
    outputs = model(**inputs)
  File "/home/stas/anaconda3/envs/py38-pt20/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/fsmt/modeling_fsmt.py", line 1251, in forward
    masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.tgt_vocab_size), labels.view(-1))
  File "/home/stas/anaconda3/envs/py38-pt20/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt20/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/stas/anaconda3/envs/py38-pt20/lib/python3.8/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (56) to match target batch_size (48).
akshara-krishnam commented 1 year ago

@stas00 I am facing the same issue while fine-tuning t5-small using examples/pytorch/summarization/run_summarization.py And I can see preds has -100 and so decode fails with the below error:

Traceback (most recent call last):
  File "examples/pytorch/summarization/run_summarization.py", line 751, in <module> main()
  File "examples/pytorch/summarization/run_summarization.py", line 705, in main
    predict_results = trainer.predict(predict_dataset, metric_key_prefix="predict")
  File "src/transformers/trainer_seq2seq.py", line 216, in predict
    return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "src/transformers/trainer.py", line 3069, in predict
    output = eval_loop(
  File "src/transformers/trainer.py", line 3281, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "examples/pytorch/summarization/run_summarization.py", line 635, in compute_metrics
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
  File "src/transformers/tokenization_utils_base.py", line 3446, in batch_decode
    return [
  File "src//transformers/tokenization_utils_base.py", line 3447, in <listcomp>
    self.decode(
  File "src/transformers/tokenization_utils_base.py", line 3486, in decode
    return self._decode(
  File "src/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
sgugger commented 1 year ago

The first issue is addressed in #22693

The second issue with FSMT is due to this line added by @gante . The decoder_input_ids not passed to generate result in generations that have the same length as the inputs and not the targets.

akshara-krishnam commented 1 year ago

@sgugger thanks for the fix. I can see the same issue in line 718 https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization.py#L718

possible fix: preds= np.where(predict_results.predictions != -100, predict_results.predictions, tokenizer.pad_token_id) predictions = tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=True)

sgugger commented 1 year ago

Good catch, adding this too in the PR.

sgugger commented 1 year ago

Thinking more, I think this is also a result of the recent changes in generate, which used to be the one padding the result with tokenizer.pad_token_id, and it's now the Trainer padding them with -100. cc @gante

gante commented 1 year ago

Hey everyone -- the last issues should be gone with #22772, but feel free to comment/reopen if any related problem persists!

GabHoo commented 1 year ago

Hi! since a couple of weeks I also stumbled on this error. It was working just fine before. I am pretty sure I have transformer installed from source so the PR with the fix is there as well. I am using Bart-large and the Trainer class. I first define rouge as training evaluation function:

    def compute_rouge(pred): 
        predictions, labels = pred
        #decode the predictions
        decode_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        #decode labels
        decode_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        #compute results
        res = rouge.compute(predictions=decode_predictions, references=decode_labels, use_stemmer=True)
        #get %
        return res

And give it to the trainer

       trainer = Seq2SeqTrainer(
                    model, 
                    args,
                    train_dataset=tokenized_dataset['train'],
                    eval_dataset=tokenized_dataset['valid'],
                    data_collator=collator,
                    tokenizer=tokenizer,
                    compute_metrics=compute_rouge
                )

Then the script breaks in Trainer.train, while decoding for dev set evaluation:

Traceback (most recent call last):
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/finetunemodel.py", line 226, in <module>
    main(args)
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/finetunemodel.py", line 149, in main
    trainer.train()
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 2022, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 2288, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 159, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 2994, in evaluate
    output = eval_loop(
             ^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 3283, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/finetunemodel.py", line 103, in compute_rouge
    decode_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3456, in batch_decode
    return [
           ^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3457, in <listcomp>
    self.decode(
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3496, in decode
    return self._decode(
            ^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: out of range integral type conversion attempted

Interestingly enough, on a similar formatted dataset (but longer text) while using Longformer (led), I get the same error but this time at prediction time, thus the trained is completed successfully:

Traceback (most recent call last):
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/LED_4_DWIE.py", line 236, in <module>
    main(args)
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/LED_4_DWIE.py", line 161, in main
    preds, labels, metrics = trainer.predict(tokenized_dataset['test'], num_beams=5, min_length=50, max_length=max_target, no_repeat_ngram_size=2, early_stopping=True)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 216, in predict
    return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 3070, in predict
    output = eval_loop(
             ^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 3283, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/LED_4_DWIE.py", line 103, in compute_rouge
    decode_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3456, in batch_decode
    return [
           ^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3457, in <listcomp>
    self.decode(
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3496, in decode
    return self._decode(
           ^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: out of range integral type conversion attempted
gante commented 1 year ago

Hey @GabHoo -- could you share with us a short stand-alone script to reproduce the issue? :)

GabHoo commented 1 year ago

Thank you for time. Here is a standaone version of the script. I hope it is the case,


 from transformers import AutoTokenizer,AutoModelForSeq2SeqLM,DataCollatorForSeq2Seq,Seq2SeqTrainingArguments,Seq2SeqTrainer
import os
from datasets import load_dataset
import numpy as np
from utils import *
import torch
import evaluate
import sys
import json
import time
import argparse

def tokenize_for_evaluation(tokenizer,preds,labels):

    predicted_text = []
    golden_labels = []

    for pred, label in zip(preds, labels):

        gen = tokenizer.decode(pred, skip_special_tokens=True)
        gen = str(gen)
        predicted_text.append(gen)

        gold = tokenizer.decode(label, skip_special_tokens=True)
        gold = str(gold)
        golden_labels.append(gold)

        return predicted_text,golden_labels

def process_data_BART(data_to_process,tokenizer,max_input,max_target,typeKG ):

    #get the dialogue text
    inputs = [graph for graph in data_to_process[f'{typeKG}']]
    #tokenize text
    model_inputs = tokenizer(inputs,  max_length=max_input, padding='max_length', truncation=True)

    #tokenize labels
    #with tokenizer.as_target_tokenizer():
    targets = [target for target in data_to_process['story']]
    model_targets = tokenizer(targets, max_length=max_target, padding='max_length', truncation=True)

    #reuturns input_ids, attention_masks, labels

    data_to_process["input_ids"] = model_inputs.input_ids
    data_to_process["attention_mask"] = model_inputs.attention_mask
    data_to_process["labels"] = model_targets.input_ids

    return data_to_process

datapath ='/daatapath
dataprefix ='pop'
typeKG = 'Instances_KG'
model_checkpoint="facebook/bart-base"
experiment_name = 'exp'
learning_rate =1e-4
batch_size = 1
epochs =3
save_model = False
max_target = 512
max_input = 512

train_file = datapath +'/' + dataprefix + '_train' + '.json'
dev_file = datapath +'/'+ dataprefix + '_dev' + '.json'
test_file = datapath +'/' + dataprefix + '_test'+ '.json'

print("Loading dataset from ",datapath)
dataset = load_dataset('json', data_files={'train': train_file, 'valid': dev_file, 'test': test_file})

todrop=list(set(dataset['test'].column_names)-set([typeKG,'story'])) #This line returns a list of all the columns to drop (all columns minus the ones we need (input typeKG and story))

print("Loading tokenizer")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,add_eos_token=True)

print("\nProcessing Dataset")
#the processing of the data is done batches for make it faster,number of processes 4
tokenized_dataset = dataset.map(lambda example: process_data_BART(example, tokenizer,max_input,max_target,typeKG), batched=True, num_proc=4,remove_columns=todrop)

print("\nLoading MODEL")
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
#model.to(device)

print("Collator for batches")
collator = DataCollatorForSeq2Seq(tokenizer, model=model) #this is necessary for diving in batch for training

print('Loading rouge')
rouge = evaluate.load('rouge')

def compute_rouge(pred): 
    predictions, labels = pred
    #decode the predictions
    decode_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    #decode labels
    decode_labels = tokenizer.batch_decode(labels, skip_special_tokens=True,clean_up_tokenization_spaces=True)

    #compute results
    res = rouge.compute(predictions=decode_predictions, references=decode_labels, use_stemmer=True)
    #get %
    return res

print("\nPREPARING FOR TRAINING...")

#defining training arogouments
args = Seq2SeqTrainingArguments(
    experiment_name,
    evaluation_strategy='epoch',
    learning_rate=learning_rate, 
    per_device_train_batch_size= batch_size,
    per_device_eval_batch_size= batch_size,
    gradient_accumulation_steps=3, #compute gradient on n examples KG story 
    weight_decay=0.01, #regularization
    save_total_limit=1, #this is the max amount of checkpoint saved, after which previous checpoints are removed
    num_train_epochs=epochs, #number of epochs
    predict_with_generate=True, 
    generation_max_length = 512, #max number of tokens per generation 
    generation_num_beams=5, #decoding strategy! greedy search, beam search 
    eval_accumulation_steps=1, #backprop  
    fp16=True, #memory management
    disable_tqdm=True)
#only CUDA available -> fp16=True

### almost training time
trainer = Seq2SeqTrainer(
    model, 
    args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['valid'],
    data_collator=collator,
    tokenizer=tokenizer,
    compute_metrics=compute_rouge
)

trainer.train()

if save_model:
    print("Saving model")
    trainer.save_model(experiment_name+"/saved_model")

print("\nPREDICTING..")
preds, labels, metrics = trainer.predict(tokenized_dataset['test'], num_beams=5, min_length=50, max_length=512, no_repeat_ngram_size=2, early_stopping=True)

predicted_text,golden_labels=tokenize_for_evaluation(tokenizer,preds,labels)

#here is already past the error 
print("\nRESULT SCORES:")

scores = metrics.items()
print(f'Results: {scores}')

The data looks the following, to substitute folde in data/path

    {
        "story": "Baymax is a character from the film Big Hero 6 starring Scott Adsit. He was created by Steven T Seagle and the American, Duncan Rouleau.",
        "Types_KG": "[CORE] Baymax is a character from the film Big Hero 6 [TRIPLES] Duncan Rouleau - nationality - Americans | Baymax - creators - Duncan Rouleau | Baymax - creator - Steven T. Seagle | Baymax - series - Big Hero 6 (film) | Big Hero 6 (film) - starring - Scott Adsit | Baymax - creator - Duncan Rouleau | Duncan Rouleau - nationality - Americans | Baymax - creators - Steven T. Seagle | Baymax - series - Big Hero 6 (film) | Big Hero 6 (film) - starring - Scott Adsit | Scott Adsit - type - person | Americans - type - ethnic group | Steven T. Seagle - type - person | Duncan Rouleau - type - person | Big Hero 6 (film) - type - person",
        "Instances_KG": "[CORE] Baymax is a character from the film Big Hero 6 [TRIPLES] Duncan Rouleau - nationality - Americans | Baymax - creators - Duncan Rouleau | Baymax - creator - Steven T. Seagle | Baymax - series - Big Hero 6 (film) | Big Hero 6 (film) - starring - Scott Adsit | Baymax - creator - Duncan Rouleau | Duncan Rouleau - nationality - Americans | Baymax - creators - Steven T. Seagle | Baymax - series - Big Hero 6 (film) | Big Hero 6 (film) - starring - Scott Adsit",
        "
gante commented 1 year ago

@GabHoo I'm afraid you'll have you will have to share complete data example or another script, the current instructions fail at data loading time if I create a file as specified. (ArrowInvalid: JSON parse error: Missing a name for object member. in row 0)

Pavloveuge commented 1 year ago

@GabHoo Hello, I had same problem and I think problem in DataCollatorForSeq2Seq, more specifically in label_pad_token_id. Collator using label_pad_token_id = -100, but your tokenizer using a different (tokenizer.pad_token_id = 1). Can you try? collator = DataCollatorForSeq2Seq(tokenizer, model=model, label_pad_token_id=tokenizer.pad_token_id)

Pavloveuge commented 1 year ago

Hey @gante, I think behavior of DataCollatorForSeq2Seq is really unexpected. Why it requires label_pad_token_id, if it can use tokenizer.pad_token_id as with padding_side?

gante commented 1 year ago

Hey @Pavloveuge -- the label padding triggers a different behavior at train time (if my memory does not fail me, the loss is ignored for that token)

Pavloveuge commented 1 year ago

Oh, yeah, you right, but this behavior still results in an error. And it doesn't matter which version of the tokenizer I use(Fast or not). In case use_fast=False: TypeError: sequence item 9: expected str instance, NoneType found in case use_fast=True: OverflowError: out of range integral type conversion attempted.

gante commented 1 year ago

@Pavloveuge that sounds like a bug indeed :) Would you be able to share a short stand-alone script to reproduce the issue?

Pavloveuge commented 1 year ago

@gante Should I open new issue or reopen this?

gante commented 1 year ago

@Pavloveuge A new issue would be preferable πŸ‘