microsoft / DeBERTa

The implementation of DeBERTa
MIT License
1.91k stars 216 forks source link

Cannot get the [MASK] token correctly tokenized #86

Open tqfang opened 2 years ago

tqfang commented 2 years ago

In the current version, when running:

from DeBERTa import deberta
vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base')
tokenizer = deberta.tokenizers[vocab_type](vocab_path)
tokens = tokenizer.tokenize('[MASK]')
print(tokens)
print(tokenizer.convert_tokens_to_ids(tokenizer.mask()))
print(tokenizer.convert_tokens_to_ids("[MASK]"))
print(tokenizer.vocab["[MASK]"])

Output:

['▁[', 'MAS', 'K', ']']
[4746, 829, 291, 179, 1015, 552]
[4746, 829, 291, 179, 1015, 552]
128000

Neither method can give a correct id of the special token "[MASK]", i.e., 128000.

Is this a bug or am I using the tokenizer in the wrong way? Thanks

s-JoL commented 2 years ago

mask is generated after tokenize