Open jonathanasdf opened 1 month ago
The following code caused this bug:
out_string = ( out_string.replace(" .", ".") .replace(" ?", "?") .replace(" !", "!") .replace(" ,", ",") .replace(" ' ", "'") .replace(" n't", "n't") .replace(" 'm", "'m") .replace(" 's", "'s") .replace(" 've", "'ve") .replace(" 're", "'re") ) return out_string
This code is located in the file transformers\transformers\src\transformers\tokenization_utils_base.py in line 4075
This error can also be reproduced if any word starts with these subwords: m, s, ve, re.
For example:
tok.decode([1232, 364, 23129])
result:
':'verb
You basically just have to set cleanup_tokenization_spaces = False
>>> import transformers
>>> tok = transformers.AutoTokenizer.from_pretrained('baseten/Meta-Llama-3-tokenizer')
>>> tok.decode([1232, 364], cleanup_tokenization_spaces=False)
"': '"
>>> tok.decode([364, 1874], cleanup_tokenization_spaces=False)
"'search"
>>> tok.decode([1232, 364, 1874], cleanup_tokenization_spaces=False)
"':'search"
Sorry it's clean_up_tokenization_spaces
:
>>> import transformers
>>> tok = transformers.AutoTokenizer.from_pretrained('baseten/Meta-Llama-3-tokenizer')
>>> tok.decode([1232, 364, 1874], clean_up_tokenization_spaces=False)
"': 'search"
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.44.0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Output should be
': 'search
; the space between the colon and quote character should be kept