Open gpetho opened 6 days ago
Hey there is a small typo indeed:
(
f"{self.__class__.__name__}(name_or_path='{self.name_or_path}',"
f" vocab_size={self.vocab_size}, model_max_length={self.model_max_length}, is_fast={self.is_fast},"
f" padding_side='{self.padding_side}', truncation_side='{self.truncation_side}',"
- f" special_tokens={self.special_tokens_map}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces})"
+ f" special_tokens={self.special_tokens_map}, clean_up_tokenization_spaces={self.clean_up_tokenization_spaces},"
- " added_tokens_decoder={\n\t" + added_tokens_decoder_rep + "\n}"
+ " added_tokens_decoder={\n\t" + added_tokens_decoder_rep + "\n}\n)"
)
is probably what you want!
do you want to open a PR? 🤗
Hi there, thanks for catching that typo and providing the fix! I appreciate you taking the time to point it out. If @gpetho is busy at the moment, I’d be happy to quickly submit a small PR with the correction.
System Info
transformers 4.46.0 Any OS and python version
Who can help?
@ArthurZucker @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The repr of tokenizer objects is incorrectly formatted due to this part of the code: https://github.com/huggingface/transformers/blob/1d063793318b20654ebb850f48f43e0a247ab7bb/src/transformers/tokenization_utils_base.py#L1684C1-L1692C10
The repr of a Tokenizer object looks like this:
Tokenizer(...), added_tokens_decoder={...}
Whereas is should look like this:Tokenizer(..., added_tokens_decoder={...})
The dict that is the value of theadded_tokens_decoder
attribute should be listed within the parentheses along with the other attributes, not after the closing parenthesis.The current representation is problematic because having the
added_tokens_decoder
outside the main parenthesized structure breaks the expected flow of representing object attributes, and it's confusing. It suggests that the relationship between the tokenizer parameters and the added tokens decoder is different from what it actually is. Someone reading the string representation could assume it's a separate entity instead of an attribute belonging to the tokenizer.Lines 1690-1691 should be corrected like this: