Why the tokenizer is slower than tiktoken?

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.92k stars 776 forks source link

Why the tokenizer is slower than tiktoken? #1519

Open BigBinnie opened 5 months ago

BigBinnie commented 5 months ago

Hi, I tried to use the GPT2 tokenizer of HF and TikToken, but I found TikToken is faster than HF. Could you explain why this might happen?

ArthurZucker commented 5 months ago

Hey, could you share a reproducer? Some things are related to the fact that we keep track of the offset and a lot of information, which tiktoken does not. But we could only do this when ask and improve speed potentially.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 3 months ago

It's high in my priority to do benchmarks and improve our code if needed!

BigBinnie commented 3 months ago

For HF, we use

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "xxx"
start = time.time()
encoded_input = tokenizer.encode(truncated_text)
end = time.time()

For tiktoken, we just initialize the tokenizer by tiktoken, all the other are the same

tokenizer = tiktoken.encoding_for_model("gpt-2")

please let me know if you need any other information

ArthurZucker commented 3 months ago

You are using GPT2Tokenizer which is the slow one. Use GPT2TokenizerFast 😅

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 2 months ago

We actually dived a bit:

Rayon parallelism is kinda broken
we have concurency on the cache for GPT2
We have memory allocation that are also slowing down With #1560, was able to get similar performances as tiktoken, keep posted 😉

ArthurZucker commented 2 months ago

One thing tho, is that tiktoken forces the spilt of very long sequences. If you split them in batch you are already gonna have quite a lot better perfs