Confusion regarding conflicting information in model card of "mosaic-bert" on Hugging Face

mscherrmann commented 1 year ago

I have been referring to the model card of "mosaic-bert" on Hugging Face, and I noticed some conflicting information that has left me confused. In the model card, it is mentioned that the vocabulary size was increased to a multiple of 8 and 64 (from 30,522 to 30,528 tokens). This suggests that either a new vocabulary was fitted or additional tokens were added to the standard bert-base tokenizer.

However, later in the model card, it states that "The tokenizer for this model is simply the Hugging Face bert-base-uncased tokenizer" with the original 30,522 tokens. This inconsistency has created confusion for me, as it seems contradictory to the previous statement regarding the increased vocabulary size.

Additionally, the respective blog mentions that one can use their own custom vocabulary and tokenizer for a specific domain. I am intrigued by this possibility but unsure of how it can be accomplished. I would appreciate further clarification on how to employ a custom vocabulary and tokenizer with the "mosaic-bert" model.

Thank you for your attention to this matter. I look forward to your response and clarification regarding these issues.

dakinggg commented 1 year ago

The tokenizer is unchanged, but the model's word embedding layer is expanded for hardware efficiency. This means that there are a few unused tokens that are never seen during training.

To employ a new tokenizer, you would need to (1) create that tokenizer (2) train a new model using data processed by that tokenizer. We don't have a guide for (1), but i recommend looking through huggingface's materials on the subject (https://huggingface.co/learn/nlp-course/chapter6/2?fw=pt#training-a-new-tokenizer could be a good starting point). For (2), once you have your new tokenizer, you can basically just plug it into the existing code/yamls.

mscherrmann commented 1 year ago

Perfect, thank you very much!

mosaicml / examples

Confusion regarding conflicting information in model card of "mosaic-bert" on Hugging Face #408