Questions on modifying a vocabulary vs. training a LM from scratch

brijow commented 3 years ago

I have come across many similar issues asking about how to add new tokens to a vocabulary, for reference, here are a couple links to useful comments made for doing roughly that:

However, I am concerned with how to first identify tokens that make sense to add to an existing tokenizer's vocabulary, and also possibly whether or not it makes sense to consider removing tokens from a vocabulary.

Some context into my situation:

My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks.

But from my current understanding, to first obtain that domain-specific language model, I basically have two options:

train a tokenizer from scratch and then use that tokenizer to train a LM from scratch.
modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.

I am struggling with the first option because (as far as I know) training a language model from scratch is quite expensive, and although I do have some budget for this, I do not have on the order of thousands of dollars.

I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause.

To summarize:

I'd really like to know if there is a low cost option for training a LM from scratch to do option 1 above
Or, if option 2 makes more sense, how to properly modify a vocabulary (find good new tokens, remove unused ones, etc), and adapt the model to overcome potential negative side effects of messing with the embeddings.

Thanks for the help. Sorry for a long question but I thought some context may be needed since I might be asking the wrong question in the first place. Cheers.

Narsil commented 3 years ago

Hi @brijow ,

There is no good way to add any token to an existing LM, as you wouldn't know what embedding makes the most sense, and during finetuning it would probably pick it up, but IMO, will most likely slow down the training, and be pretty bad if such new tokens are rare (which I imagine they could very well be). Worth trying if you're willing to put the effort. Removing tokens is as tricky as currently ids correspond to row indices within embedding matrices, so removing a row will modifying all following ids. Still doable (by reindexing all vocabulary), but I wouldn't even attempt it unless you are removing 75%+ of the vocabulary, as it's unlikely to lead to better performance before (most compute bottleneck is the actual model, and embedding matrices don't even require that much RAM for single language) Please note that for BPE, you need ot update both vocab AND merges for instance, Unigram might be slightly simpler

Usually the recommended way is to keep the vocabulary as-is and simply finetune. Relevant tokens will be updated more often, and this will lead to better overall performance. Rare tokens, will still benefit from the pre-training on general langage (instead of being random and potentially destroying performance)

brijow commented 3 years ago

Thanks @Narsil . I'm just a little unclear about a couple things you mention:

When you say "BPE, you need tp update both vocab AND merges", how can I can update the merges? I successfully updated the vocabulary (and the embedding matrix dims for the corresponding BERT based model I am using), but didn't do anything with these merges you're referring to.
In your closing sentence when you say "Rare tokens, will still benefit from the pre-training on general language (instead of being random and potentially destroying performance)", are you referring to the case of not modifying the vocab at all and simply fine-tuning the pretrained LM on the LM task (e.g. on MLM)?

Thanks!

Narsil commented 3 years ago

Well if you remove token "abc" from the vocab and keep around the merge "a", "bc" , you're likely to encounter issues (I am not sure, but it's definitely not intended by the library).

Yes, exactly.

brijow commented 3 years ago

Thank you, makes sense!

ptheru commented 3 years ago

modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.

I am on the same boat, why can't we repurpose the unused tokens in the vocab [UNK] ? Is there a way to replace '[UNK]' instead of extending the vocab file? https://github.com/google-research/bert/issues/9#issuecomment-434796704

kumarme072 commented 9 months ago

@ArthurZucker

ArthurZucker commented 9 months ago

You should be able to manually update the content of the tokenizer.json to make the id corresponding to [unused] to it. It kinda has to be manual I fear

huggingface / tokenizers

Questions on modifying a vocabulary vs. training a LM from scratch #747