huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.98k stars 788 forks source link

Adding many AddedTokens makes loading a tokenizer extremely slow. #1635

Open stephantul opened 2 weeks ago

stephantul commented 2 weeks ago

Hi!

I'm not sure if this is a problem that can be solved, or needs to be solved. Basically, we want to make a kind of hybrid tokenizer, in which we add a whole bunch of whole words to a tokenizer, and select these words instead of the subwords if they appear.

For example: if we pass the pretokenized string ["dog", "walks", "around", "Paris"], and "Paris" is a whole token, we want to select it instead of decomposing it into subtokens. I think that adding Paris as an AddedToken is the right approach for this (but please correct me if I'm wrong.)

So, we added many of these tokens (about 400k), but this makes loading a tokenizer extremely slow, like, it takes 15-30 minutes to load. We now add them as regular tokens, which works fine, but which has the downside of also finding these whole word tokens as part of other words. For example Parisians will now be turned into ["Paris", "##ians"], which might have a different meaning.

So my main question is: is there a reason why adding many AddedTokens is slow? Or is this just a path that hasn't been fully optimized yet?

Is using AddedTokens in this way simply wrong? Should we be trying something else?

Thanks! Stéphan

ArthurZucker commented 1 week ago

Hey! It depends on which API you are using! If you are using transformers it was kind of expected as adding special and non special was hard. If you are using pure tokenizers, one thing is we have to add new regex match cases for each new token.

ArthurZucker commented 1 week ago

If you want to use a better way, I would recommend you to add them as regular + make sure you add the merge rules! This means adding paths to fusing these tokens! THis can be automatically done. If that is of interest to you, provide me a reproducer with a model on the hub and I can helP!

stephantul commented 1 week ago

Hey @ArthurZucker , thanks for your response!

I'm using the pure tokenizers API. However, I am using a WordPiece tokenizer (actually just the baai/bge-base-en-v1.5 tokenizer, which AFAIK is just the OG bert tokenizer), not a BPE tokenizer. I see how adding merges to the BPE tokenizer could lead to a good solution though, so that's a cool idea.

So my vocabulary is a list with 400k tokens (just the vocabulary of the GLoVe vectors). So assuming vocab is a list of 400k strings, this already takes a lot of time:

from tokenizers import Tokenizer

tok = Tokenizer.from_pretrained("baai/bge-base-en-v1.5")
tok.add_tokens(vocab)

This wouldn't really matter to me, but this cost is incurred every time the tokenizer is loaded from disk, which makes the cost of using it prohibitive. I could maybe convert it to BPE, but I'm not sure if that makes sense.

I'll upload the resulting tokenizer once it's done, and post another comment.

Thanks!

stephantul commented 1 week ago

Here you go: https://huggingface.co/stephantulkens/large_tokenizer/tree/main