huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.92k stars 26.27k forks source link

Incorrect decode from GPTNeox tokenizer. #25840

Closed tbenthompson closed 11 months ago

tbenthompson commented 1 year ago

System Info

Reproduction

When running this code:

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/pythia-12b-deduped")
text = " there is this someone 'you' who has the ability of 'sensing things"
ids = tokenizer.encode(text)
print(repr(tokenizer.decode(ids)))
print(repr("".join(tokenizer.batch_decode(ids))))

I get the output:

" there is this someone 'you' who has the ability of'sensing things"
" there is this someone 'you' who has the ability of 'sensing things"

Expected behavior

The first string produced by tokenizer.decode is an incorrect decoding. The second string from batch_decode is correct. The first string is missing a space before 'sensing.

amyeroberts commented 1 year ago

cc @ArthurZucker

ArthurZucker commented 1 year ago

You should try with tokenizer.decode(ids, clean_up_tokenization_spaces = False):

>>> import transformers

>>> tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/pythia-12b-deduped")
>>> text = " there is this someone 'you' who has the ability of 'sensing things"
>>> ids = tokenizer.encode(text)
>>> print(repr(tokenizer.decode(ids,  clean_up_tokenization_spaces = False)))
" there is this someone 'you' who has the ability of 'sensing things"
tbenthompson commented 1 year ago

Ok, thanks! For my own understanding, why is the default clean_up_tokenization_spaces = True? Without that setting, decode and encode are much closer to being the inverse of each other. Intuitively, that seems like it should be the default goal of decode

ArthurZucker commented 1 year ago

Good question, has been that way for a long time, I think this was to reflect some of our original tokenizers that were adding spaces. I'll check if we can safely remove this / set to False by default!

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.