huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.52k stars 26.17k forks source link

Tokenizer regression in 4.43-dev affecting Aya/Command-r 35B models #32081

Open Qubitium opened 1 month ago

Qubitium commented 1 month ago

System Info

Ubuntu 22.04

Who can help?

@ArthurZucker

Reproduction

4.42.4 has no such issue. Regression/crash only happens on transformer tip/main.

Traceback (most recent call last):
  File "/root/projects/go/python/ai/train/sft_trainer.py", line 673, in <module>
    trainer = SFTTrainer(
              ^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 373, in __init__
    train_dataset = self._prepare_dataset(
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 519, in _prepare_dataset
    return self._prepare_non_packed_dataloader(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 587, in _prepare_non_packed_dataloader
    tokenized_dataset = dataset.map(
                        ^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3161, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/root/miniconda3/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3552, in _map_single
    batch = apply_function_on_filtered_inputs(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3421, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 557, in tokenize
    outputs = tokenizer(
              ^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2945, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3032, in _call_one
    return self.batch_encode_plus(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3228, in batch_encode_plus
    return self._batch_encode_plus(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/cohere/tokenization_cohere_fast.py", line 174, in _batch_encode_plus
    return super()._batch_encode_plus(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 561, in _batch_encode_plus
    for key in tokens_and_encodings[0][0].keys():
               ~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Expected behavior

Not crash.

ArthurZucker commented 1 month ago

Hey! SOrry could you share a small reproduction snipet?

ArthurZucker commented 1 month ago

cc @itazap as well!

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.