Deberta Tokenizatiion - Githubissues

yaysummeriscoming commented 3 years ago

Environment info

transformers version: 4.0.0
Platform: Linux
Python version: 3.8
PyTorch version (GPU?): 1.7

Who can help

@BigBird01 @LysandreJik

Information

I'd like to use the new deberta model, but it seems that the tokens aren't mapped correctly?

from transformers import AutoTokenizer

test_string = 'hello, I am a dog'

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
print('Roberta output is: ', tokenizer.tokenize(test_string))

tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base')
print('Deberta output is: ', tokenizer.tokenize(test_string))

Roberta output is: ['hello', ',', 'ĠI', 'Ġam', 'Ġa', 'Ġdog'] Deberta output is: ['31373', '11', '314', '716', '257', '3290']

I'd expect deberta to give an output similar to roberta, rather than numbers.

yaysummeriscoming commented 3 years ago

@LysandreJik any update on this?

BigBird01 commented 3 years ago

@yaysummeriscoming To get sub words instead of numbers, you can call tokenizer.gpt2_tokenizer.decode(tokens). Please take a look at our code for reference.

yaysummeriscoming commented 3 years ago

That did the trick, thanks!

huggingface / transformers

Deberta Tokenizatiion #8872

Environment info

Who can help

Information