huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.32k stars 26.86k forks source link

Incorrect repr string for tokenizer objects #34437

Open gpetho opened 6 days ago

gpetho commented 6 days ago

System Info

transformers 4.46.0 Any OS and python version

Who can help?

@ArthurZucker @itazap

Information

Tasks

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct') 
print(tokenizer)

Expected behavior

The repr of tokenizer objects is incorrectly formatted due to this part of the code: https://github.com/huggingface/transformers/blob/1d063793318b20654ebb850f48f43e0a247ab7bb/src/transformers/tokenization_utils_base.py#L1684C1-L1692C10

The repr of a Tokenizer object looks like this: Tokenizer(...), added_tokens_decoder={...} Whereas is should look like this: Tokenizer(..., added_tokens_decoder={...}) The dict that is the value of the added_tokens_decoder attribute should be listed within the parentheses along with the other attributes, not after the closing parenthesis.

The current representation is problematic because having the added_tokens_decoder outside the main parenthesized structure breaks the expected flow of representing object attributes, and it's confusing. It suggests that the relationship between the tokenizer parameters and the added tokens decoder is different from what it actually is. Someone reading the string representation could assume it's a separate entity instead of an attribute belonging to the tokenizer.

Lines 1690-1691 should be corrected like this:

            f" special_tokens={self.special_tokens_map}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces}, "
            " added_tokens_decoder={\n\t" + added_tokens_decoder_rep + "\n})"
ArthurZucker commented 4 days ago

Hey there is a small typo indeed:

(
            f"{self.__class__.__name__}(name_or_path='{self.name_or_path}',"
            f" vocab_size={self.vocab_size}, model_max_length={self.model_max_length}, is_fast={self.is_fast},"
            f" padding_side='{self.padding_side}', truncation_side='{self.truncation_side}',"
-            f" special_tokens={self.special_tokens_map}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces})"
+            f" special_tokens={self.special_tokens_map}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces},"
-            " added_tokens_decoder={\n\t" + added_tokens_decoder_rep + "\n}"
+            " added_tokens_decoder={\n\t" + added_tokens_decoder_rep + "\n}\n)"
        )

is probably what you want!

do you want to open a PR? 🤗

bhargavyagnik commented 4 days ago

Hi there, thanks for catching that typo and providing the fix! I appreciate you taking the time to point it out. If @gpetho is busy at the moment, I’d be happy to quickly submit a small PR with the correction.