huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Tokenization is super slow when using XGLMTokenizer or XGLMTokenizerFast #1405

Closed jonas-klesen closed 5 months ago

jonas-klesen commented 7 months ago

Hi all.

The tokenizers are much too slow.

I did some simple measurements using the following script:

import time
import matplotlib.pyplot as plt
import numpy as np

from transformers import AutoTokenizer
tokenizer_xglm = AutoTokenizer.from_pretrained("facebook/xglm-564M", fast=True)
tokenizer_pythia = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m", fast=True)

tokenizer_xglm_slow = AutoTokenizer.from_pretrained("facebook/xglm-564M", fast=False)
tokenizer_pythia_slow = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m", fast=False)

text = "I went to the store and bought some milk. "

times_xglm = []
times_pythia = []

times_xglm_slow = []
times_pythia_slow = []

max=10000

for repeat in range(1000,max,1000):
    start = time.time()
    tokenizer_xglm(text * repeat)
    end = time.time()
    times_xglm.append(end - start)

    start2 = time.time()
    tokenizer_pythia(text * repeat)
    end2 = time.time()
    times_pythia.append(end2 - start2)

    # Add timing for slow tokenizers
    start3 = time.time()
    tokenizer_xglm_slow(text * repeat)
    end3 = time.time()
    times_xglm_slow.append(end3 - start3)

    start4 = time.time()
    tokenizer_pythia_slow(text * repeat)
    end4 = time.time()
    times_pythia_slow.append(end4 - start4)

# Create a numpy array for the lengths
lengths = np.arange(1000, max, 1000)

# Plot the time measurements
plt.figure(figsize=(10, 6))
plt.plot(lengths, times_xglm, label='xglm', color="blue")
plt.plot(lengths, times_pythia, label='pythia',color="red")
plt.plot(lengths, times_xglm_slow, label='xglm slow',color="darkblue")
plt.plot(lengths, times_pythia_slow, label='pythia slow', color="darkred") 
plt.xlabel('Repeats of text')
plt.ylabel('Time to tokenize (s)')
plt.legend()
plt.show()

The resulting plot looks like this: image

Here is the environment this ran on: Python 3.9.17, transformers 4.28.0, tokenizers 0.13.3, sentencepiece 0.1.99, protobuf 3.20.0.

Since this is not up-to-date, I also ran the same script on a more updated environment:

image

In this case, the environment is: Python 3.11.5, transformers 4.31.0, tokenizers 0.13.3, sentencepiece 0.1.99, protobuf 4.25.1.

Running on Windows, but I also saw the same behavior on a Linux system.

I am a bit baffled, but I really need to use the XGLM tokenizers. Any help is greatly appreciated!

Thanks, Jonas

ArthurZucker commented 7 months ago

The fast tokenizers can be slower if you don't run in parallel / if the input sequence is very very long but unique see the following issue: https://github.com/huggingface/transformers/issues/25873

jonas-klesen commented 7 months ago

@ArthurZucker Thank you dearly for the answer!

But that is not the issue i am seeing. If you look at the graph i posted, both blue curves are xglmtokenizer (fast and nonfast versions) and they are equally slow.

Especially compared to other tokenizers (i just picked any), i.e. the red lines (fast and nonfast pythia tokenizers).

Also, regarding the issue you posted, I actually tried the exact same, i e. tokenizing wikitext as one large string, and it ran for over an hour before i stopped it. On a beefy machine, that is.

Thus, I still believe that something is wrong with the XGLMTokenizers, both fast and nonfast.

Thank you again! Any further help on how to debug/solve this is appreciated.

jonas-klesen commented 7 months ago

Sorry for double-posting. I would really appreciate an answer to this. Any idea, or at least an idea about how to go about debugging and fixing this? @ArthurZucker

ArthurZucker commented 7 months ago

Hey! Inherently, both tokenizers use a different model, XGLM uses a Unigram while pythia uses a BPE. The difference you are seeing is because BPE is optimised to process long, very long sequences, while Unigram does not seem to be as efficient.

The following will give

image
import time
import matplotlib.pyplot as plt
import numpy as np

from transformers import AutoTokenizer
tokenizer_xglm = AutoTokenizer.from_pretrained("facebook/xglm-564M", fast=True)
tokenizer_pythia = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m", fast=True)

tokenizer_xglm_slow = AutoTokenizer.from_pretrained("facebook/xglm-564M", fast=False)
tokenizer_pythia_slow = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m", fast=False)

text = "I went to the store and bought some milk. "

times_xglm = []
times_pythia = []

times_xglm_slow = []
times_pythia_slow = []

max=10000

for repeat in range(1000,max,1000):
    start = time.time()
-   tokenizer_xglm(text * repeat)
+   tokenizer_xglm([text] * repeat)
    end = time.time()
    times_xglm.append(end - start)

    start2 = time.time()
    tokenizer_pythia(text * repeat)
    end2 = time.time()
    times_pythia.append(end2 - start2)

    # Add timing for slow tokenizers
    start3 = time.time()
-   tokenizer_xglm_slow([text] * repeat)
+   tokenizer_xglm_slow([text] * repeat)
    end3 = time.time()
    times_xglm_slow.append(end3 - start3)

    start4 = time.time()
    tokenizer_pythia_slow(text * repeat)
    end4 = time.time()
    times_pythia_slow.append(end4 - start4)

# Create a numpy array for the lengths
lengths = np.arange(1000, max, 1000)

# Plot the time measurements
plt.figure(figsize=(10, 6))
plt.plot(lengths, times_xglm, label='xglm', color="blue")
plt.plot(lengths, times_pythia, label='pythia',color="red")
plt.plot(lengths, times_xglm_slow, label='xglm slow',color="darkblue")
plt.plot(lengths, times_pythia_slow, label='pythia slow', color="darkred") 
plt.xlabel('Repeats of text')
plt.ylabel('Time to tokenize (s)')
plt.legend()
plt.show()
jonas-klesen commented 6 months ago

Thank you dearly for your repeated help. That is indeed what is going on.

I want pytorch tensors out. If I do return_tensors='pt', I get an error due to padding, which is expected. In the end, I want the same result as passing the whole thing at once (or, very close). I think I have to do add_special_tokens=False for all after the first and add space (as a token) in-between the parts when concatenating? I guess this is a common problem. Is there a canonical way on how to do this?

ArthurZucker commented 6 months ago

Yes, I think I would split the data / add padding if needed but only to the last sentence . Yes a space or adding a new token like <sep> can also help

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.