iPieter / RobBERT

A Dutch RoBERTa-based language model
https://pieter.ai/robbert/
MIT License
196 stars 29 forks source link

Missing Token `Ċ` in vocabulary for NER Model #30

Closed eevers-avisi closed 2 years ago

eevers-avisi commented 2 years ago

Hi @iPieter,

I was trying to use your robbert-v2-dutch-ner model in my code for fine-tuning. I would like to use the tokenizer as a fast tokenizer, so that I'm able to use the word id's to know from which words the tokens do origin.

Unfortunately I'm not able to create a RobertaTokenizerFast from the pretrained model because of the following error: Error while initializing BPE: Token Ċ out of vocabulary

When trying to find a solution, I saw the following issue (https://github.com/huggingface/transformers/issues/9290) which mentions almost the same problem (I think) for the robbert-v2-dutch-base model.

Is it possible that the same fix applied for the base model is also applied for the NER model?

iPieter commented 2 years ago

Thanks for your interest in RobBERT. That error was caused by an upgrade of the Tokenizer library, which changed the way the vocabulary gets loaded in. Fixing this is a bit of a hassle, which is why it took so long...

You can find the updated model here: https://huggingface.co/pdelobelle/robbert-v2-dutch-ner

I hope it's still useful for you!