[run_translation.py] out of range integral type conversion attempted

stas00 commented 1 year ago

spitting off from https://github.com/huggingface/transformers/issues/22571 as it was a secondary problem reported there:

Reproduction

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=src python examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-base --do_train --do_eval --source_lang en \
--target_lang de --source_prefix 'translate English to German: ' \
--dataset_name stas/wmt14-en-de-pre-processed --output_dir \
/tmp/tst-translation --num_train_epochs 1 --per_device_train_batch_size=1 \
--max_train_samples 10 --overwrite_output_dir --seed 1137 \
--per_device_eval_batch_size 1 --predict_with_generate --fp16 \
--max_eval_samples 10

fails inside eval:

[INFO|trainer.py:3126] 2023-04-04 09:28:07,548 >> ***** Running Evaluation *****
[INFO|trainer.py:3128] 2023-04-04 09:28:07,548 >>   Num examples = 10
[INFO|trainer.py:3131] 2023-04-04 09:28:07,548 >>   Batch size = 1
[INFO|configuration_utils.py:575] 2023-04-04 09:28:07,552 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.28.0.dev0"
}

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00,  3.72it/s]Traceback (most recent call last):
  File "examples/pytorch/translation/run_translation.py", line 664, in <module>
    main()
  File "examples/pytorch/translation/run_translation.py", line 605, in main
    metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer_seq2seq.py", line 159, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 2990, in evaluate
    output = eval_loop(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 3278, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "examples/pytorch/translation/run_translation.py", line 546, in compute_metrics
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 3445, in batch_decode
    return [
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 3446, in <listcomp>
    self.decode(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 3485, in decode
    return self._decode(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted

@sgugger

sgugger commented 1 year ago

cc @ArthurZucker Seems like the model and tokenizer have mismatched length

ArthurZucker commented 1 year ago

Yeah, but :

the tokenizer has 100 additional special tokens so even if the model predicts something above 32000 (the model's vocab size) you get an extra id (until 32099)

the tokenizer has an unk_token so when you go above 32099, the fast simply outputs '' while the slow '<extra_id_-29>' (which is a bit strange I'll give you that 😅 snippet:

>>> from transformers import T5Tokenizer, T5TokenizerFast
>>> tokenizer_slow = T5Tokenizer.from_pretrained("t5-base")
>>> tokenizer_slow.decode(32140) # above vocab size
'<extra_id_-3167901>'
>>> tokenizer_fast = T5TokenizerFast.from_pretrained("t5-base")
''

The issue is different. This is a integer overflow in rust:


>>> tokenizer_fast.decode(3200000000000)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[29], line 1
----> 1 tokenizer_fast.decode(3200000000000)

File ~/Work/transformers/src/transformers/tokenization_utils_base.py:3485, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, kwargs) 3482 # Convert inputs to python lists 3483 token_ids = to_py_obj(token_ids) -> 3485 return self._decode( 3486 token_ids=token_ids, 3487 skip_special_tokens=skip_special_tokens, 3488 clean_up_tokenization_spaces=clean_up_tokenization_spaces, 3489 kwargs, 3490 )

File ~/Work/transformers/src/transformers/tokenization_utils_fast.py:549, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs) 547 if isinstance(token_ids, int): 548 token_ids = [token_ids] --> 549 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens) 551 clean_up_tokenization_spaces = ( 552 clean_up_tokenization_spaces 553 if clean_up_tokenization_spaces is not None 554 else self.clean_up_tokenization_spaces 555 ) 556 if clean_up_tokenization_spaces:

OverflowError: out of range integral type conversion attempted


That means you are juste giving a huge huge number to decode is there a reason ?

stas00 commented 1 year ago

Please note I've only relayed the errors reported on the pytorch Issued by a user trying to use torch.compile.

stefan-it commented 1 year ago

Hi guys,

I have the same problem with the run_seq2seq_qa.py script and it turns out, that preds are passed to the decode function, with the following content:

[[     0 250099   1013 ...   -100   -100   -100]  
 [     0 250099   1013 ...   -100   -100   -100]                          
 [     0 250099   1013 ...   -100   -100   -100]                          
 ...                                                                                     
 [     0 250099    260 ...   -100   -100   -100]              
 [     0 250099    442 ...   -100   -100   -100]
 [     0 250099   3883 ...   -100   -100   -100]]

So the problematic thing here is -100 I guess, because I can reproduce the error with:

>>> tokenizer.decode(-100)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/transformers/src/transformers/tokenization_utils_base.py", line 3485, in decode
    return self._decode(
  File "/home/ubuntu/transformers/src/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted

ArthurZucker commented 1 year ago

Awsome thanks for providing this! Indeed these should be converted to padding

stas00 commented 1 year ago

Could it be similar to this fix? https://github.com/huggingface/transformers/pull/18592 The hardcoded -100 doesn't seem to always do the right thing.

stas00 commented 1 year ago

I tried with another model arch and it's breaks too but in another way. so eval is quite broken in many ways.

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=src python examples/pytorch/translation/run_translation.py --model_name_or_path  'facebook/wmt19-en-ru'  --do_train --do_eval --source_lang en --target_lang de --source_prefix 'translate English to German: ' --dataset_name stas/wmt14-en-de-pre-processed --output_dir /tmp/tst-translation --num_train_epochs 1 --per_device_train_batch_size=1 --max_train_samples 10 --overwrite_output_dir --seed 1137 --per_device_eval_batch_size 1 --predict_with_generate --fp16 --max_eval_samples 10

Traceback (most recent call last):
  File "examples/pytorch/translation/run_translation.py", line 664, in <module>
    main()
  File "examples/pytorch/translation/run_translation.py", line 605, in main
    metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer_seq2seq.py", line 159, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 2993, in evaluate
    output = eval_loop(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 3174, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer_seq2seq.py", line 290, in prediction_step
    outputs = model(**inputs)
  File "/home/stas/anaconda3/envs/py38-pt20/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/fsmt/modeling_fsmt.py", line 1251, in forward
    masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.tgt_vocab_size), labels.view(-1))
  File "/home/stas/anaconda3/envs/py38-pt20/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt20/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/stas/anaconda3/envs/py38-pt20/lib/python3.8/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (56) to match target batch_size (48).

akshara-krishnam commented 1 year ago

@stas00 I am facing the same issue while fine-tuning t5-small using examples/pytorch/summarization/run_summarization.py And I can see preds has -100 and so decode fails with the below error:

Traceback (most recent call last):
  File "examples/pytorch/summarization/run_summarization.py", line 751, in <module> main()
  File "examples/pytorch/summarization/run_summarization.py", line 705, in main
    predict_results = trainer.predict(predict_dataset, metric_key_prefix="predict")
  File "src/transformers/trainer_seq2seq.py", line 216, in predict
    return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "src/transformers/trainer.py", line 3069, in predict
    output = eval_loop(
  File "src/transformers/trainer.py", line 3281, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "examples/pytorch/summarization/run_summarization.py", line 635, in compute_metrics
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
  File "src/transformers/tokenization_utils_base.py", line 3446, in batch_decode
    return [
  File "src//transformers/tokenization_utils_base.py", line 3447, in <listcomp>
    self.decode(
  File "src/transformers/tokenization_utils_base.py", line 3486, in decode
    return self._decode(
  File "src/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted

sgugger commented 1 year ago

The first issue is addressed in #22693

The second issue with FSMT is due to this line added by @gante . The decoder_input_ids not passed to generate result in generations that have the same length as the inputs and not the targets.

akshara-krishnam commented 1 year ago

@sgugger thanks for the fix. I can see the same issue in line 718 https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization.py#L718

possible fix: preds= np.where(predict_results.predictions != -100, predict_results.predictions, tokenizer.pad_token_id) predictions = tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=True)

sgugger commented 1 year ago

Good catch, adding this too in the PR.

sgugger commented 1 year ago

Thinking more, I think this is also a result of the recent changes in generate, which used to be the one padding the result with tokenizer.pad_token_id, and it's now the Trainer padding them with -100. cc @gante

gante commented 1 year ago

Hey everyone -- the last issues should be gone with #22772, but feel free to comment/reopen if any related problem persists!

GabHoo commented 1 year ago

Hi! since a couple of weeks I also stumbled on this error. It was working just fine before. I am pretty sure I have transformer installed from source so the PR with the fix is there as well. I am using Bart-large and the Trainer class. I first define rouge as training evaluation function:

    def compute_rouge(pred): 
        predictions, labels = pred
        #decode the predictions
        decode_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        #decode labels
        decode_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        #compute results
        res = rouge.compute(predictions=decode_predictions, references=decode_labels, use_stemmer=True)
        #get %
        return res

And give it to the trainer

       trainer = Seq2SeqTrainer(
                    model, 
                    args,
                    train_dataset=tokenized_dataset['train'],
                    eval_dataset=tokenized_dataset['valid'],
                    data_collator=collator,
                    tokenizer=tokenizer,
                    compute_metrics=compute_rouge
                )

Then the script breaks in Trainer.train, while decoding for dev set evaluation:

Traceback (most recent call last):
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/finetunemodel.py", line 226, in <module>
    main(args)
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/finetunemodel.py", line 149, in main
    trainer.train()
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 2022, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 2288, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 159, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 2994, in evaluate
    output = eval_loop(
             ^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 3283, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/finetunemodel.py", line 103, in compute_rouge
    decode_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3456, in batch_decode
    return [
           ^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3457, in <listcomp>
    self.decode(
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3496, in decode
    return self._decode(
            ^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: out of range integral type conversion attempted

Interestingly enough, on a similar formatted dataset (but longer text) while using Longformer (led), I get the same error but this time at prediction time, thus the trained is completed successfully:

Traceback (most recent call last):
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/LED_4_DWIE.py", line 236, in <module>
    main(args)
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/LED_4_DWIE.py", line 161, in main
    preds, labels, metrics = trainer.predict(tokenized_dataset['test'], num_beams=5, min_length=50, max_length=max_target, no_repeat_ngram_size=2, early_stopping=True)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 216, in predict
    return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 3070, in predict
    output = eval_loop(
             ^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/trainer.py", line 3283, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/kg2Narrative/KGNarrative2/script4trainingLLM/LED_4_DWIE.py", line 103, in compute_rouge
    decode_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3456, in batch_decode
    return [
           ^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3457, in <listcomp>
    self.decode(
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3496, in decode
    return self._decode(
           ^^^^^^^^^^^^^
  File "/home/ghoogerw/.conda/envs/kg2Narrative/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 549, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: out of range integral type conversion attempted

gante commented 1 year ago

Hey @GabHoo -- could you share with us a short stand-alone script to reproduce the issue? :)

GabHoo commented 1 year ago

Thank you for time. Here is a standaone version of the script. I hope it is the case,


 from transformers import AutoTokenizer,AutoModelForSeq2SeqLM,DataCollatorForSeq2Seq,Seq2SeqTrainingArguments,Seq2SeqTrainer
import os
from datasets import load_dataset
import numpy as np
from utils import *
import torch
import evaluate
import sys
import json
import time
import argparse

def tokenize_for_evaluation(tokenizer,preds,labels):

    predicted_text = []
    golden_labels = []

    for pred, label in zip(preds, labels):

        gen = tokenizer.decode(pred, skip_special_tokens=True)
        gen = str(gen)
        predicted_text.append(gen)

        gold = tokenizer.decode(label, skip_special_tokens=True)
        gold = str(gold)
        golden_labels.append(gold)

        return predicted_text,golden_labels

def process_data_BART(data_to_process,tokenizer,max_input,max_target,typeKG ):

    #get the dialogue text
    inputs = [graph for graph in data_to_process[f'{typeKG}']]
    #tokenize text
    model_inputs = tokenizer(inputs,  max_length=max_input, padding='max_length', truncation=True)

    #tokenize labels
    #with tokenizer.as_target_tokenizer():
    targets = [target for target in data_to_process['story']]
    model_targets = tokenizer(targets, max_length=max_target, padding='max_length', truncation=True)

    #reuturns input_ids, attention_masks, labels

    data_to_process["input_ids"] = model_inputs.input_ids
    data_to_process["attention_mask"] = model_inputs.attention_mask
    data_to_process["labels"] = model_targets.input_ids

    return data_to_process

datapath ='/daatapath
dataprefix ='pop'
typeKG = 'Instances_KG'
model_checkpoint="facebook/bart-base"
experiment_name = 'exp'
learning_rate =1e-4
batch_size = 1
epochs =3
save_model = False
max_target = 512
max_input = 512

train_file = datapath +'/' + dataprefix + '_train' + '.json'
dev_file = datapath +'/'+ dataprefix + '_dev' + '.json'
test_file = datapath +'/' + dataprefix + '_test'+ '.json'

print("Loading dataset from ",datapath)
dataset = load_dataset('json', data_files={'train': train_file, 'valid': dev_file, 'test': test_file})

todrop=list(set(dataset['test'].column_names)-set([typeKG,'story'])) #This line returns a list of all the columns to drop (all columns minus the ones we need (input typeKG and story))

print("Loading tokenizer")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,add_eos_token=True)

print("\nProcessing Dataset")
#the processing of the data is done batches for make it faster,number of processes 4
tokenized_dataset = dataset.map(lambda example: process_data_BART(example, tokenizer,max_input,max_target,typeKG), batched=True, num_proc=4,remove_columns=todrop)

print("\nLoading MODEL")
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
#model.to(device)

print("Collator for batches")
collator = DataCollatorForSeq2Seq(tokenizer, model=model) #this is necessary for diving in batch for training

print('Loading rouge')
rouge = evaluate.load('rouge')

def compute_rouge(pred): 
    predictions, labels = pred
    #decode the predictions
    decode_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    #decode labels
    decode_labels = tokenizer.batch_decode(labels, skip_special_tokens=True,clean_up_tokenization_spaces=True)

    #compute results
    res = rouge.compute(predictions=decode_predictions, references=decode_labels, use_stemmer=True)
    #get %
    return res

print("\nPREPARING FOR TRAINING...")

#defining training arogouments
args = Seq2SeqTrainingArguments(
    experiment_name,
    evaluation_strategy='epoch',
    learning_rate=learning_rate, 
    per_device_train_batch_size= batch_size,
    per_device_eval_batch_size= batch_size,
    gradient_accumulation_steps=3, #compute gradient on n examples KG story 
    weight_decay=0.01, #regularization
    save_total_limit=1, #this is the max amount of checkpoint saved, after which previous checpoints are removed
    num_train_epochs=epochs, #number of epochs
    predict_with_generate=True, 
    generation_max_length = 512, #max number of tokens per generation 
    generation_num_beams=5, #decoding strategy! greedy search, beam search 
    eval_accumulation_steps=1, #backprop  
    fp16=True, #memory management
    disable_tqdm=True)
#only CUDA available -> fp16=True

### almost training time
trainer = Seq2SeqTrainer(
    model, 
    args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['valid'],
    data_collator=collator,
    tokenizer=tokenizer,
    compute_metrics=compute_rouge
)

trainer.train()

if save_model:
    print("Saving model")
    trainer.save_model(experiment_name+"/saved_model")

print("\nPREDICTING..")
preds, labels, metrics = trainer.predict(tokenized_dataset['test'], num_beams=5, min_length=50, max_length=512, no_repeat_ngram_size=2, early_stopping=True)

predicted_text,golden_labels=tokenize_for_evaluation(tokenizer,preds,labels)

#here is already past the error 
print("\nRESULT SCORES:")

scores = metrics.items()
print(f'Results: {scores}')

The data looks the following, to substitute folde in data/path

    {
        "story": "Baymax is a character from the film Big Hero 6 starring Scott Adsit. He was created by Steven T Seagle and the American, Duncan Rouleau.",
        "Types_KG": "[CORE] Baymax is a character from the film Big Hero 6 [TRIPLES] Duncan Rouleau - nationality - Americans | Baymax - creators - Duncan Rouleau | Baymax - creator - Steven T. Seagle | Baymax - series - Big Hero 6 (film) | Big Hero 6 (film) - starring - Scott Adsit | Baymax - creator - Duncan Rouleau | Duncan Rouleau - nationality - Americans | Baymax - creators - Steven T. Seagle | Baymax - series - Big Hero 6 (film) | Big Hero 6 (film) - starring - Scott Adsit | Scott Adsit - type - person | Americans - type - ethnic group | Steven T. Seagle - type - person | Duncan Rouleau - type - person | Big Hero 6 (film) - type - person",
        "Instances_KG": "[CORE] Baymax is a character from the film Big Hero 6 [TRIPLES] Duncan Rouleau - nationality - Americans | Baymax - creators - Duncan Rouleau | Baymax - creator - Steven T. Seagle | Baymax - series - Big Hero 6 (film) | Big Hero 6 (film) - starring - Scott Adsit | Baymax - creator - Duncan Rouleau | Duncan Rouleau - nationality - Americans | Baymax - creators - Steven T. Seagle | Baymax - series - Big Hero 6 (film) | Big Hero 6 (film) - starring - Scott Adsit",
        "

gante commented 1 year ago

@GabHoo I'm afraid you'll have you will have to share complete data example or another script, the current instructions fail at data loading time if I create a file as specified. (ArrowInvalid: JSON parse error: Missing a name for object member. in row 0)

Pavloveuge commented 1 year ago

@GabHoo Hello, I had same problem and I think problem in DataCollatorForSeq2Seq, more specifically in label_pad_token_id. Collator using label_pad_token_id = -100, but your tokenizer using a different (tokenizer.pad_token_id = 1). Can you try? collator = DataCollatorForSeq2Seq(tokenizer, model=model, label_pad_token_id=tokenizer.pad_token_id)

Pavloveuge commented 1 year ago

Hey @gante, I think behavior of DataCollatorForSeq2Seq is really unexpected. Why it requires label_pad_token_id, if it can use tokenizer.pad_token_id as with padding_side?

gante commented 1 year ago

Hey @Pavloveuge -- the label padding triggers a different behavior at train time (if my memory does not fail me, the loss is ignored for that token)

Pavloveuge commented 1 year ago

Oh, yeah, you right, but this behavior still results in an error. And it doesn't matter which version of the tokenizer I use(Fast or not). In case use_fast=False: TypeError: sequence item 9: expected str instance, NoneType found in case use_fast=True: OverflowError: out of range integral type conversion attempted.

gante commented 1 year ago

@Pavloveuge that sounds like a bug indeed :) Would you be able to share a short stand-alone script to reproduce the issue?

Pavloveuge commented 1 year ago

@gante Should I open new issue or reopen this?

gante commented 1 year ago

@Pavloveuge A new issue would be preferable 👍

huggingface / transformers

[run_translation.py] out of range integral type conversion attempted #22634

Reproduction