huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132k stars 26.29k forks source link

Deberta Tokenizatiion #8872

Closed yaysummeriscoming closed 3 years ago

yaysummeriscoming commented 3 years ago

Environment info

Who can help

@BigBird01 @LysandreJik

Information

I'd like to use the new deberta model, but it seems that the tokens aren't mapped correctly?

from transformers import AutoTokenizer

test_string = 'hello, I am a dog'

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
print('Roberta output is: ', tokenizer.tokenize(test_string))

tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base')
print('Deberta output is: ', tokenizer.tokenize(test_string))

Roberta output is: ['hello', ',', 'ĠI', 'Ġam', 'Ġa', 'Ġdog'] Deberta output is: ['31373', '11', '314', '716', '257', '3290']

I'd expect deberta to give an output similar to roberta, rather than numbers.

yaysummeriscoming commented 3 years ago

@LysandreJik any update on this?

BigBird01 commented 3 years ago

@yaysummeriscoming To get sub words instead of numbers, you can call tokenizer.gpt2_tokenizer.decode(tokens). Please take a look at our code for reference.

yaysummeriscoming commented 3 years ago

That did the trick, thanks!