google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.04k stars 9.59k forks source link

Adding custom domain words and abbreviations to vocab.txt #1083

Open saklanipankaj opened 4 years ago

saklanipankaj commented 4 years ago

Hi i am working on adding a few terms from my domain to the vocab.txt. I am working with the multi language cased pre-trained model 'multi_cased_L-12_H-768_A-12' . However i am unsure of what words should be added. So my question is should the new words added to the [unused X] lines be words not found in the english dictionary and be words unique to my domain, or should i be adding all frequently used words, even if they are common english words, but are not in the vocab.txt file. Additionally should abbreviations be added such as "LOL", and should the lowercase version 'lol', be also added since i am using a cased pre-trained model.

Crescentz commented 4 years ago

Have you found a way