neuralmind-ai / portuguese-bert

Portuguese pre-trained BERT models
Other
792 stars 122 forks source link

'[UNK]' during tokenization when word starts with 'q' #34

Open shubhanshu786 opened 3 years ago

shubhanshu786 commented 3 years ago

Hi, I faced an issue with the tokenizer while tokenizing text that start from q. I found that q is missing from vocab.txt file. (Q is present.)

tokenizer.tokenize('q qnm') ['[UNK]', '[UNK]']

Simple fix i tried: Add q into tokenizer using add_tokens method (huggingface), but it failed to produce exact/correct tokenization.

tokenizer.add_tokens(['q']) tokenizer.tokenize('q qnm') ['q', 'q' 'n', '##m']

Here n should be ##n, while due to added q separately, it will treat q as new token and will try to split it separately. Which is not a correct solution down the line.

Solution suggested: Add q into the vocab.txt file, that way it will result in correct tokenization. (I added at the last of vocab.txt file and updated model embedding size, not sure how it will work with model down the line. Yet to test)

tokenizer.tokenize('q qnm') ['q', 'q', '##n', '##m']

I hope you will release updated tokenizer vocab.txt file with added token q.