iPieter / RobBERT

A Dutch RoBERTa-based language model
https://pieter.ai/robbert/
MIT License
197 stars 29 forks source link

Choice of Vocabulary #12

Closed tomseinen closed 4 years ago

tomseinen commented 4 years ago

Hello,

Great work! I was just wondering why you used an English vocabulary for a Dutch model? Do you have a specific reason to do that? I saw for example that the Dutch BERT model (Bertje) is using a Dutch vocab and a Spanish model (RuBERTa) is using a Spanish vocab.

Using a Dutch vocab will probably increase the performance of RobBERT. What do you think?

Thank you

tomseinen commented 4 years ago

My bad, I didn't see issue #6. But I am still curious what would happen if you do use a Dutch tokenizer.