Tokenizer english or dutch

iPieter / RobBERT

A Dutch RoBERTa-based language model

https://pieter.ai/robbert/

MIT License

196 stars 29 forks source link

Tokenizer english or dutch #6

Closed erwin314 closed 4 years ago

erwin314 commented 4 years ago

When downloading the pretrained Roberta Tokenizer and inspecting the merges.txt and vocab.json, they seem to be English. I was expecting Dutch.

Are the files correct? or is the download or my expectation wrong?

To download and inspect use:

tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robBERT-base")
tokenizer.save_vocabulary("/somefolder")

twinters commented 4 years ago

Hi Erwin,

These files are correct.

RobBERT was trained using the default Fairseq RoBERTa tokenizer, which is indeed an English tokenizer. We assume that the fact that the model still performs really well stems from the fact that Dutch and English are quite closely related languages. As mentioned in the future work section of our paper, we are currently looking into changing and improving the tokenizer in a future version using the new tokenizers repository that HuggingFace recently released.

Hope this helps!

iPieter commented 4 years ago

As @twinters already commented, those files are indeed correct.

It's interesting that we do achieve SOTA results, even on task with typical Dutch tokens (die/dat). This is likely due to Dutch and English sharing the same Latin script, being closely related languages, and borrowing a lot of words. When looking at GPT-2—which also uses BPE tokenization—a similar phenomena can be observed for Dutch (even without finetuning, which you can try at HuggingFace's demo).

Given this, we've released our first version with Fairseq's standard tokenization and are planning to release a second version, as well as an in-depth discussion about the effects of the tokenization.

Since the files are correct, I'm closing this issue.

erwin314 commented 4 years ago

Hi Thomas, Thanks for your help! I had the pleasure of reading your paper a while back. Seems I missed the mention about this in the future work section.

Looking forward to the next version. Including dutch tokens and Unicode glyphs should give some improvements.

erwin314 commented 4 years ago

and thank you Pieter. As I said, looking forward to reading the next version.