huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.47k stars 26.39k forks source link

Facing issues when trying to fine-tune T5 #28111

Closed wolfassi123 closed 8 months ago

wolfassi123 commented 9 months ago

System Info

Who can help?

@ArthurZucker @youne

Information

Tasks

Reproduction

I am trying to fine tune a T5-base model but have been facing issues despite following the step by step guide found on the huggingface hub here.

So far this is my code: transformers.logging.set_verbosity_error()

from datasets import load_dataset

canard_train_augm = load_dataset("gaussalgo/Canard_Wiki-augmented", split="train")
canard_test_augm = load_dataset("gaussalgo/Canard_Wiki-augmented", split="test")

from transformers import AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(examples):
    combined_input = examples["Question"] + ": " + examples["true_contexts"]
    return tokenizer(combined_input, examples["Rewrite"],max_length=512, padding="max_length", truncation=True, return_tensors="pt")

tokenized_train = canard_train_augm.map(preprocess_function)
tokenized_test = canard_test_augm.map(preprocess_function)

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

import evaluate

metric = evaluate.load("sacrebleu")

import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

training_args = Seq2SeqTrainingArguments(
    output_dir="wtf",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

I tried several examples including my own Customized Class for the trainer function but always ended with the same issue even when I tried the same code found in the step-by-step guide provided by huggingface.

The error happens when calling the trainer.train() returning the following: ValueError: too many values to unpack (expected 2)

I followed the exact same format as the documentation and I believe it is something that is happening when calling the loss function but was just unable to put my finger to it, if anyone can help that would be great.

Expected behavior

Expected behavior is trying being able to fine-tune the T5 model with the above dataset by eliminating or identifying the cause of the error.

ArthurZucker commented 9 months ago

Hey, can you share the exact traceback to debug this?

wolfassi123 commented 9 months ago

Hey, can you share the exact traceback to debug this?

Sure thing!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-17-cb38b76c7066>](https://localhost:8080/#) in <cell line: 24>()
     22 )
     23 
---> 24 trainer.train()

12 frames
[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1553                 hf_hub_utils.enable_progress_bars()
   1554         else:
-> 1555             return inner_training_loop(
   1556                 args=args,
   1557                 resume_from_checkpoint=resume_from_checkpoint,

[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1858 
   1859                 with self.accelerator.accumulate(model):
-> 1860                     tr_loss_step = self.training_step(model, inputs)
   1861 
   1862                 if (

[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in training_step(self, model, inputs)
   2723 
   2724         with self.compute_loss_context_manager():
-> 2725             loss = self.compute_loss(model, inputs)
   2726 
   2727         if self.args.n_gpu > 1:

[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in compute_loss(self, model, inputs, return_outputs)
   2746         else:
   2747             labels = None
-> 2748         outputs = model(**inputs)
   2749         # Save past state if it exists
   2750         # TODO: this needs to be fixed and made cleaner later.

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

[/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py](https://localhost:8080/#) in forward(*args, **kwargs)
    678 
    679     def forward(*args, **kwargs):
--> 680         return model_forward(*args, **kwargs)
    681 
    682     # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`

[/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py](https://localhost:8080/#) in __call__(self, *args, **kwargs)
    666 
    667     def __call__(self, *args, **kwargs):
--> 668         return convert_to_fp32(self.model_forward(*args, **kwargs))
    669 
    670     def __getstate__(self):

[/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py](https://localhost:8080/#) in decorate_autocast(*args, **kwargs)
     14     def decorate_autocast(*args, **kwargs):
     15         with autocast_instance:
---> 16             return func(*args, **kwargs)
     17 
     18     decorate_autocast.__script_unsupported = "@autocast() decorator is not supported in script mode"  # type: ignore[attr-defined]

[/usr/local/lib/python3.10/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1707         if encoder_outputs is None:
   1708             # Convert encoder inputs in embeddings if needed
-> 1709             encoder_outputs = self.encoder(
   1710                 input_ids=input_ids,
   1711                 attention_mask=attention_mask,

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

[/usr/local/lib/python3.10/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1016             inputs_embeds = self.embed_tokens(input_ids)
   1017 
-> 1018         batch_size, seq_length = input_shape
   1019 
   1020         # required mask seq length can be calculated via length of past

ValueError: too many values to unpack (expected 2)
ArthurZucker commented 9 months ago

I placed a breakpoint in your code, there is an issue with the inputs:

inputs["input_ids"].shape
torch.Size([16, 1, 512])

there is an extra dimension which probably comes from the way the dataset is processed / the data collator!

ArthurZucker commented 9 months ago

The following code fixed it:

def preprocess_function(examples):
    combined_input = examples["Question"] + ": " + examples["true_contexts"]
    model_inputs = tokenizer(combined_input, max_length=512, padding="max_length", truncation=True)

    labels = tokenizer(text_target=examples["Rewrite"], max_length=512, padding="max_length", truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
kylefoley76 commented 9 months ago

Hey, can you share the exact traceback to debug this?

Sure thing!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-17-cb38b76c7066>](https://localhost:8080/#) in <cell line: 24>()
     22 )
     23 
---> 24 trainer.train()

12 frames
[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1553                 hf_hub_utils.enable_progress_bars()
   1554         else:
-> 1555             return inner_training_loop(
   1556                 args=args,
   1557                 resume_from_checkpoint=resume_from_checkpoint,

[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1858 
   1859                 with self.accelerator.accumulate(model):
-> 1860                     tr_loss_step = self.training_step(model, inputs)
   1861 
   1862                 if (

[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in training_step(self, model, inputs)
   2723 
   2724         with self.compute_loss_context_manager():
-> 2725             loss = self.compute_loss(model, inputs)
   2726 
   2727         if self.args.n_gpu > 1:

[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in compute_loss(self, model, inputs, return_outputs)
   2746         else:
   2747             labels = None
-> 2748         outputs = model(**inputs)
   2749         # Save past state if it exists
   2750         # TODO: this needs to be fixed and made cleaner later.

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

[/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py](https://localhost:8080/#) in forward(*args, **kwargs)
    678 
    679     def forward(*args, **kwargs):
--> 680         return model_forward(*args, **kwargs)
    681 
    682     # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`

[/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py](https://localhost:8080/#) in __call__(self, *args, **kwargs)
    666 
    667     def __call__(self, *args, **kwargs):
--> 668         return convert_to_fp32(self.model_forward(*args, **kwargs))
    669 
    670     def __getstate__(self):

[/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py](https://localhost:8080/#) in decorate_autocast(*args, **kwargs)
     14     def decorate_autocast(*args, **kwargs):
     15         with autocast_instance:
---> 16             return func(*args, **kwargs)
     17 
     18     decorate_autocast.__script_unsupported = "@autocast() decorator is not supported in script mode"  # type: ignore[attr-defined]

[/usr/local/lib/python3.10/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1707         if encoder_outputs is None:
   1708             # Convert encoder inputs in embeddings if needed
-> 1709             encoder_outputs = self.encoder(
   1710                 input_ids=input_ids,
   1711                 attention_mask=attention_mask,

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

[/usr/local/lib/python3.10/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1016             inputs_embeds = self.embed_tokens(input_ids)
   1017 
-> 1018         batch_size, seq_length = input_shape
   1019 
   1020         # required mask seq length can be calculated via length of past

ValueError: too many values to unpack (expected 2)

Hi Wolfassi, actually I'm writing to talk to you about your work in Arabic OCR. I've been trying to do some Arabic OCR but not I can only get about 95% accuracy rate. Have you been able to do any better than that and if so, how?

wolfassi123 commented 8 months ago

Hey, can you share the exact traceback to debug this?

Sure thing!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-17-cb38b76c7066>](https://localhost:8080/#) in <cell line: 24>()
     22 )
     23 
---> 24 trainer.train()

12 frames
[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1553                 hf_hub_utils.enable_progress_bars()
   1554         else:
-> 1555             return inner_training_loop(
   1556                 args=args,
   1557                 resume_from_checkpoint=resume_from_checkpoint,

[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1858 
   1859                 with self.accelerator.accumulate(model):
-> 1860                     tr_loss_step = self.training_step(model, inputs)
   1861 
   1862                 if (

[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in training_step(self, model, inputs)
   2723 
   2724         with self.compute_loss_context_manager():
-> 2725             loss = self.compute_loss(model, inputs)
   2726 
   2727         if self.args.n_gpu > 1:

[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in compute_loss(self, model, inputs, return_outputs)
   2746         else:
   2747             labels = None
-> 2748         outputs = model(**inputs)
   2749         # Save past state if it exists
   2750         # TODO: this needs to be fixed and made cleaner later.

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

[/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py](https://localhost:8080/#) in forward(*args, **kwargs)
    678 
    679     def forward(*args, **kwargs):
--> 680         return model_forward(*args, **kwargs)
    681 
    682     # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`

[/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py](https://localhost:8080/#) in __call__(self, *args, **kwargs)
    666 
    667     def __call__(self, *args, **kwargs):
--> 668         return convert_to_fp32(self.model_forward(*args, **kwargs))
    669 
    670     def __getstate__(self):

[/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py](https://localhost:8080/#) in decorate_autocast(*args, **kwargs)
     14     def decorate_autocast(*args, **kwargs):
     15         with autocast_instance:
---> 16             return func(*args, **kwargs)
     17 
     18     decorate_autocast.__script_unsupported = "@autocast() decorator is not supported in script mode"  # type: ignore[attr-defined]

[/usr/local/lib/python3.10/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1707         if encoder_outputs is None:
   1708             # Convert encoder inputs in embeddings if needed
-> 1709             encoder_outputs = self.encoder(
   1710                 input_ids=input_ids,
   1711                 attention_mask=attention_mask,

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

[/usr/local/lib/python3.10/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1016             inputs_embeds = self.embed_tokens(input_ids)
   1017 
-> 1018         batch_size, seq_length = input_shape
   1019 
   1020         # required mask seq length can be calculated via length of past

ValueError: too many values to unpack (expected 2)

Hi Wolfassi, actually I'm writing to talk to you about your work in Arabic OCR. I've been trying to do some Arabic OCR but not I can only get about 95% accuracy rate. Have you been able to do any better than that and if so, how?

Hello there. Yes I have previously worked on Arabic OCR and no to be honest I did not achieve that high of an accuracy. I believe an accuracy of 95% is just too high to target. I tested using both EasyOCR and Tesseract. Tesseract seemed to perform the best after you finetune the model for the specific font you are using. I would also suggest trying out Paddle Paddle.

Sauce16 commented 6 months ago

Hi @wolfassi123 I am facing the below error could you please help while fine tuning.

code:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./roberta-retrained",
    overwrite_output_dir=True,
    num_train_epochs=25,
    per_device_train_batch_size=48,
    save_total_limit=2,
)

# Initialize the Trainer
trainer = Trainer(
    model=cyberspecificmodel,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./roberta_cybersecurity")

**ERROR**
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-8-a5adcccd61c2>](https://localhost:8080/#) in <cell line: 12>()
     10 
     11 # Initialize the Trainer
---> 12 trainer = Trainer(
     13     model=cyberspecificmodel,
     14     args=training_args,

1 frames
[/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in create_accelerator_and_postprocess(self)
   4125                 accelerator_kwargs = AcceleratorConfig(**accelerator_kwargs).to_dict()
   4126 
-> 4127         self.accelerator = Accelerator(
   4128             deepspeed_plugin=self.args.deepspeed_plugin,
   4129             gradient_accumulation_plugin=gradient_accumulation_plugin,

TypeError: Accelerator.__init__() got an unexpected keyword argument 'use_seedable_sampler'
ArthurZucker commented 6 months ago

@Sauce16 this seems like an unrelated issue, open a new issue with the reproducer if you want help and make sure this is still failing with latest versions!