Missing Token `Ċ` in vocabulary for NER Model

iPieter / RobBERT

A Dutch RoBERTa-based language model

MIT License

196 stars 29 forks source link

Hi @iPieter,

I was trying to use your robbert-v2-dutch-ner model in my code for fine-tuning. I would like to use the tokenizer as a fast tokenizer, so that I'm able to use the word id's to know from which words the tokens do origin.

Unfortunately I'm not able to create a RobertaTokenizerFast from the pretrained model because of the following error: Error while initializing BPE: Token Ċ out of vocabulary

When trying to find a solution, I saw the following issue (https://github.com/huggingface/transformers/issues/9290) which mentions almost the same problem (I think) for the robbert-v2-dutch-base model.

Is it possible that the same fix applied for the base model is also applied for the NER model?

iPieter / RobBERT

Missing Token `Ċ` in vocabulary for NER Model #30