huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.1k stars 25.6k forks source link

OverflowError: out of range integral type conversion attempted #31110

Open manandey opened 1 month ago

manandey commented 1 month ago

System Info

Transformers version : 4.41.1 Python version : 3.10

Who can help?

@younesbelkada @ArthurZucker

Information

Tasks

Reproduction

When I execute the code in the colab, I get the error message: OverflowError: out of range integral type conversion attempted

This happens when I add the line: model.generation_config.max_new_tokens = max_target_length

Otherwise I get the warning: userwarning: using the model-agnostic defaultmax_length(=20) to control the generation length. we recommend settingmax_new_tokensto control the maximum length of the generation.

https://colab.research.google.com/drive/11R2MMK9nq0oe7xUXSQaW8tahhgR9cjNc?usp=sharing

cc. @younesbelkada @ArthurZucker

Expected behavior

The code should work fine.

LysandreJik commented 1 month ago

Hey @manandey, could you please put the full error message here? We're only getting the final line and it's not sufficient to debug this efficiently. Thanks.

manandey commented 1 month ago

Hey @manandey, could you please put the full error message here? We're only getting the final line and it's not sufficient to debug this efficiently. Thanks.

Sure, here it is @LysandreJik .


OverflowError                             Traceback (most recent call last)
Cell In[15], line 1
----> 1 trainer.train()

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1780, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1778         hf_hub_utils.enable_progress_bars()
   1779 else:
-> 1780     return inner_training_loop(
   1781         args=args,
   1782         resume_from_checkpoint=resume_from_checkpoint,
   1783         trial=trial,
   1784         ignore_keys_for_eval=ignore_keys_for_eval,
   1785     )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2193, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2190     self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epoch
   2191     self.control = self.callback_handler.on_step_end(args, self.state, self.control)
-> 2193     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
   2194 else:
   2195     self.control = self.callback_handler.on_substep_end(args, self.state, self.control)

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2577, in Trainer._maybe_log_save_evaluate(self, tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
   2575 metrics = None
   2576 if self.control.should_evaluate:
-> 2577     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
   2578     self._report_to_hp_search(trial, self.state.global_step, metrics)
   2580     # Run delayed LR scheduler now that metrics are populated

File /opt/conda/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:180, in Seq2SeqTrainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix, **gen_kwargs)
    178 self.gather_function = self.accelerator.gather
    179 self._gen_kwargs = gen_kwargs
--> 180 return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:3365, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   3362 start_time = time.time()
   3364 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 3365 output = eval_loop(
   3366     eval_dataloader,
   3367     description="Evaluation",
   3368     # No point gathering the predictions if there are no metrics, otherwise we defer to
   3369     # self.args.prediction_loss_only
   3370     prediction_loss_only=True if self.compute_metrics is None else None,
   3371     ignore_keys=ignore_keys,
   3372     metric_key_prefix=metric_key_prefix,
   3373 )
   3375 total_batch_size = self.args.eval_batch_size * self.args.world_size
   3376 if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:3656, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
   3652         metrics = self.compute_metrics(
   3653             EvalPrediction(predictions=all_preds, label_ids=all_labels, inputs=all_inputs)
   3654         )
   3655     else:
-> 3656         metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
   3657 else:
   3658     metrics = {}

Cell In[13], line 9, in compute_metrics(pred)
      6 labels_ids = pred.label_ids
      7 pred_ids = pred.predictions
----> 9 pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
     10 labels_ids[labels_ids == -100] = tokenizer.pad_token_id
     11 label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

File /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3785, in PreTrainedTokenizerBase.batch_decode(self, sequences, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3761 def batch_decode(
   3762     self,
   3763     sequences: Union[List[int], List[List[int]], "np.ndarray", "torch.Tensor", "tf.Tensor"],
   (...)
   3766     **kwargs,
   3767 ) -> List[str]:
   3768     """
   3769     Convert a list of lists of token ids into a list of strings by calling decode.
   3770 
   (...)
   3783         `List[str]`: The list of decoded sentences.
   3784     """
-> 3785     return [
   3786         self.decode(
   3787             seq,
   3788             skip_special_tokens=skip_special_tokens,
   3789             clean_up_tokenization_spaces=clean_up_tokenization_spaces,
   3790             **kwargs,
   3791         )
   3792         for seq in sequences
   3793     ]

File /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3786, in <listcomp>(.0)
   3761 def batch_decode(
   3762     self,
   3763     sequences: Union[List[int], List[List[int]], "np.ndarray", "torch.Tensor", "tf.Tensor"],
   (...)
   3766     **kwargs,
   3767 ) -> List[str]:
   3768     """
   3769     Convert a list of lists of token ids into a list of strings by calling decode.
   3770 
   (...)
   3783         `List[str]`: The list of decoded sentences.
   3784     """
   3785     return [
-> 3786         self.decode(
   3787             seq,
   3788             skip_special_tokens=skip_special_tokens,
   3789             clean_up_tokenization_spaces=clean_up_tokenization_spaces,
   3790             **kwargs,
   3791         )
   3792         for seq in sequences
   3793     ]

File /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3825, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3822 # Convert inputs to python lists
   3823 token_ids = to_py_obj(token_ids)
-> 3825 return self._decode(
   3826     token_ids=token_ids,
   3827     skip_special_tokens=skip_special_tokens,
   3828     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
   3829     **kwargs,
   3830 )

File /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:625, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
    623 if isinstance(token_ids, int):
    624     token_ids = [token_ids]
--> 625 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    627 clean_up_tokenization_spaces = (
    628     clean_up_tokenization_spaces
    629     if clean_up_tokenization_spaces is not None
    630     else self.clean_up_tokenization_spaces
    631 )
    632 if clean_up_tokenization_spaces:

OverflowError: out of range integral type conversion attempted```
younesbelkada commented 1 month ago

Hi @manandey This is likely because in pred_ids you might have negative values due to tokens masking. Can you print what is inside pred_ids ? You might need to do something close to:

pred_ids = pred.predictions
+ pred_ids[pred_ids == -100] = tokenizer.pad_token_id
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.pad_token_id
manandey commented 1 month ago

Thanks @younesbelkada!

manandey commented 1 week ago

@younesbelkada I am trying to update the script to work on a Colab TPU. But it seems to be not working. Can you kindly take a look and suggest if I am doing anything wrong. Thanks! https://colab.research.google.com/drive/16UYvGbMkX5laJZVwujz-GP4EnIWBskQD?usp=sharing

manandey commented 1 week ago

@younesbelkada it would be great if you could kindly help a bit on this. Thanks!

amyeroberts commented 1 week ago

Hi @manandey, is the same error happening as before?

manandey commented 1 week ago

Hi @amyeroberts, the script worked fine when run on a GPU. I am trying to make it work on a Colab TPU, but it's not working as expected. So, just wanted some help if you could have a high level look and suggest if I am doing something wrong or point me to some example scripts for running Huggingface Trainer on a TPU. Thanks!

amyeroberts commented 5 days ago

Hi @manandey, looking at the notebook it seems a different problem is being encountered from the original one in this issue. In this case, a new issue should be opened, as this ensures we can properly track open and resolved bugs.

There are guides available on running on TPU with accelerate here :https://huggingface.co/docs/accelerate/en/concept_guides/training_tpu