huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.05k stars 799 forks source link

Tokenizers v0.20.2 fails on batches as tuples #1672

Closed OyvindTafjord closed 1 week ago

OyvindTafjord commented 1 week ago

Certain fast tokenizers now fail on batches given as tuples, e.g. (on a MacBook M2 with transformers 4.46.1):

>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
>>> tok.batch_encode_plus(("hello there", "bye bye bye"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/oyvindt/miniconda3/envs/oe-eval/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3311, in batch_encode_plus
    return self._batch_encode_plus(
  File "/Users/oyvindt/miniconda3/envs/oe-eval/lib/python3.10/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 127, in _batch_encode_plus
    return super()._batch_encode_plus(*args, **kwargs)
  File "/Users/oyvindt/miniconda3/envs/oe-eval/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 529, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: argument 'input': 'tuple' object cannot be converted to 'PyList'

This works in v0.20.1. Presumably related to this PR: https://github.com/huggingface/tokenizers/pull/1665

The code for batch_encode_plus in transformers claims to be working for both tuples and lists:

        if not isinstance(batch_text_or_text_pairs, (tuple, list)):
            raise TypeError(
                f"batch_text_or_text_pairs has to be a list or a tuple (got {type(batch_text_or_text_pairs)})"
            )
argitrage commented 1 week ago

Facing the same issue, reverting to 0.20.1 solves the issue!

ArthurZucker commented 1 week ago

Ah shit, I can reproduce having a look asap!