huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.06k stars 803 forks source link

Reduce vocab size for BPE tokenizer #1668

Open fzyzcjy opened 3 weeks ago

fzyzcjy commented 3 weeks ago

Hi thanks for the library! I am using e.g. llama 3.1's tokenizer, but its 128k vocab size is too large for my field. Thus, to make training faster, I would like to reduce the tokenizer vocab size by removing the tokens that I will never use (e.g. words outside of my field). However, it seems tokenizers does not provide a convenient method for this.

ArthurZucker commented 3 weeks ago

Hey! I'll add the feature request as indeed we don't provide this out of the box. You need to also re-map the ids of the model embeddings so it's a bit more involved.

If you directly modify the tokenizer.json this can be achieved easily tho!

fzyzcjy commented 2 weeks ago

@ArthurZucker Thank you! Could you please provide a bit more details? I was thinking about modify tokenizer.json but gets worried about below:

For example, suppose I am only interested in token hello. And suppose merges are h e, l l, ll o, he llo (or something else). Then if I throw away tokens like h, e, ll, ..., or throw away the merges, then I am worried I will never get the hello token.

My naive thought is to keep all "parent" tokens (h, e, ll, ...) and their merges not removed, but that makes vocab quite large. Is there a better way?

ArthurZucker commented 2 weeks ago

Well, you can switch to use a non BPE tokenizer for example. A way to achieve that is to use added_tokens. You can remove h e ll etc and add hello as an added token

fzyzcjy commented 2 weeks ago

@ArthurZucker Thank you! However, then I am afraid the tokenization of the same sentence may be quite different. I am using a pretrained llama or something like that and doing some SFT, so I hope not to make tokenized results so wildly different that makes it confused.

ArthurZucker commented 2 weeks ago

Yeah completely get it. It's kind of an open problem : how to effectively compress a tokenizer ! The main issue is that:

  1. Not all the merges are part of the vocab
  2. All tokens should be accessible with merges
  3. You don't necessarily need all merges from the vocab for your language

Here is what I would do:

  1. train a new tokenizer on your language, set the limit you want
  2. from the newly created vocab and merges, I know what tokens are needed for my language.
  3. remap these tokens / embeddings.
fzyzcjy commented 2 weeks ago

Thanks!

remap these tokens / embeddings

I would appreciate it if I could know a bit more. I am currently thinking about reducing the tokenizer, e.g. pick 10000 vocabs from the original 128000 vocab. Then we can pick the corresponding columns in the embedding/lm_head. Seems you are doing something even more complex: choose some vocabs that may even not appear from the original vocab. Then, I wonder how should we utilize the original embedding/lm_head.

ArthurZucker commented 2 weeks ago

What I am suggesting is the same as you, but with a way of selecting those 10000 vocabs (training a new tokenizer on relevant corpus) which should yield ~ the vocab you need or should at least be a good start

fzyzcjy commented 2 weeks ago

Thank you!