huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.01k stars 795 forks source link

Questions on modifying a vocabulary vs. training a LM from scratch #747

Closed brijow closed 3 years ago

brijow commented 3 years ago

I have come across many similar issues asking about how to add new tokens to a vocabulary, for reference, here are a couple links to useful comments made for doing roughly that:

However, I am concerned with how to first identify tokens that make sense to add to an existing tokenizer's vocabulary, and also possibly whether or not it makes sense to consider removing tokens from a vocabulary.

Some context into my situation:

My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks.

But from my current understanding, to first obtain that domain-specific language model, I basically have two options:

  1. train a tokenizer from scratch and then use that tokenizer to train a LM from scratch.
  2. modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.

I am struggling with the first option because (as far as I know) training a language model from scratch is quite expensive, and although I do have some budget for this, I do not have on the order of thousands of dollars.

I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause.

To summarize:

Thanks for the help. Sorry for a long question but I thought some context may be needed since I might be asking the wrong question in the first place. Cheers.

Narsil commented 3 years ago

Hi @brijow ,

There is no good way to add any token to an existing LM, as you wouldn't know what embedding makes the most sense, and during finetuning it would probably pick it up, but IMO, will most likely slow down the training, and be pretty bad if such new tokens are rare (which I imagine they could very well be). Worth trying if you're willing to put the effort. Removing tokens is as tricky as currently ids correspond to row indices within embedding matrices, so removing a row will modifying all following ids. Still doable (by reindexing all vocabulary), but I wouldn't even attempt it unless you are removing 75%+ of the vocabulary, as it's unlikely to lead to better performance before (most compute bottleneck is the actual model, and embedding matrices don't even require that much RAM for single language) Please note that for BPE, you need ot update both vocab AND merges for instance, Unigram might be slightly simpler

Usually the recommended way is to keep the vocabulary as-is and simply finetune. Relevant tokens will be updated more often, and this will lead to better overall performance. Rare tokens, will still benefit from the pre-training on general langage (instead of being random and potentially destroying performance)

brijow commented 3 years ago

Thanks @Narsil . I'm just a little unclear about a couple things you mention:

Thanks!

Narsil commented 3 years ago

Well if you remove token "abc" from the vocab and keep around the merge "a", "bc" , you're likely to encounter issues (I am not sure, but it's definitely not intended by the library).

Yes, exactly.

brijow commented 3 years ago

Thank you, makes sense!

ptheru commented 3 years ago
  1. modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.

I am on the same boat, why can't we repurpose the unused tokens in the vocab [UNK] ? Is there a way to replace '[UNK]' instead of extending the vocab file? https://github.com/google-research/bert/issues/9#issuecomment-434796704

kumarme072 commented 9 months ago

@ArthurZucker

ArthurZucker commented 9 months ago

You should be able to manually update the content of the tokenizer.json to make the id corresponding to [unused] to it. It kinda has to be manual I fear