Llama3 Tokenizer Decode Removing Space Character

jonathanasdf commented 1 month ago

System Info

transformers version: 4.44.0
Platform: macOS-14.6.1-arm64-arm-64bit
Python version: 3.12.3
Huggingface_hub version: 0.24.2
Safetensors version: 0.4.3
Accelerate version: 0.32.1
Accelerate config: not found
PyTorch version (GPU?): 2.4.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

>>> import transformers
>>> tok = transformers.AutoTokenizer.from_pretrained('baseten/Meta-Llama-3-tokenizer')
>>> tok.decode([1232, 364])
"': '"
>>> tok.decode([364, 1874])
"'search"
>>> tok.decode([1232, 364, 1874])
"':'search"

Expected behavior

Output should be ': 'search; the space between the colon and quote character should be kept

YOUSEFNANIS commented 1 month ago

The following code caused this bug:

out_string = ( out_string.replace(" .", ".") .replace(" ?", "?") .replace(" !", "!") .replace(" ,", ",") .replace(" ' ", "'") .replace(" n't", "n't") .replace(" 'm", "'m") .replace(" 's", "'s") .replace(" 've", "'ve") .replace(" 're", "'re") ) return out_string This code is located in the file transformers\transformers\src\transformers\tokenization_utils_base.py in line 4075 This error can also be reproduced if any word starts with these subwords: m, s, ve, re. For example: tok.decode([1232, 364, 23129]) result: ':'verb

ArthurZucker commented 1 month ago

You basically just have to set cleanup_tokenization_spaces = False

jonathanasdf commented 1 month ago

>>> import transformers
>>> tok = transformers.AutoTokenizer.from_pretrained('baseten/Meta-Llama-3-tokenizer')
>>> tok.decode([1232, 364], cleanup_tokenization_spaces=False)
"': '"
>>> tok.decode([364, 1874], cleanup_tokenization_spaces=False)
"'search"
>>> tok.decode([1232, 364, 1874], cleanup_tokenization_spaces=False)
"':'search"

ArthurZucker commented 1 month ago

Sorry it's clean_up_tokenization_spaces:

>>> import transformers
>>> tok = transformers.AutoTokenizer.from_pretrained('baseten/Meta-Llama-3-tokenizer')
>>> tok.decode([1232, 364, 1874], clean_up_tokenization_spaces=False)
 "': 'search"

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers