Closed jonas-klesen closed 5 months ago
The fast tokenizers can be slower if you don't run in parallel / if the input sequence is very very long but unique see the following issue: https://github.com/huggingface/transformers/issues/25873
@ArthurZucker Thank you dearly for the answer!
But that is not the issue i am seeing. If you look at the graph i posted, both blue curves are xglmtokenizer (fast and nonfast versions) and they are equally slow.
Especially compared to other tokenizers (i just picked any), i.e. the red lines (fast and nonfast pythia tokenizers).
Also, regarding the issue you posted, I actually tried the exact same, i e. tokenizing wikitext as one large string, and it ran for over an hour before i stopped it. On a beefy machine, that is.
Thus, I still believe that something is wrong with the XGLMTokenizers, both fast and nonfast.
Thank you again! Any further help on how to debug/solve this is appreciated.
Sorry for double-posting. I would really appreciate an answer to this. Any idea, or at least an idea about how to go about debugging and fixing this? @ArthurZucker
Hey! Inherently, both tokenizers use a different model, XGLM uses a Unigram
while pythia uses a BPE
. The difference you are seeing is because BPE is optimised to process long, very long sequences, while Unigram does not seem to be as efficient.
The following will give
import time
import matplotlib.pyplot as plt
import numpy as np
from transformers import AutoTokenizer
tokenizer_xglm = AutoTokenizer.from_pretrained("facebook/xglm-564M", fast=True)
tokenizer_pythia = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m", fast=True)
tokenizer_xglm_slow = AutoTokenizer.from_pretrained("facebook/xglm-564M", fast=False)
tokenizer_pythia_slow = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m", fast=False)
text = "I went to the store and bought some milk. "
times_xglm = []
times_pythia = []
times_xglm_slow = []
times_pythia_slow = []
max=10000
for repeat in range(1000,max,1000):
start = time.time()
- tokenizer_xglm(text * repeat)
+ tokenizer_xglm([text] * repeat)
end = time.time()
times_xglm.append(end - start)
start2 = time.time()
tokenizer_pythia(text * repeat)
end2 = time.time()
times_pythia.append(end2 - start2)
# Add timing for slow tokenizers
start3 = time.time()
- tokenizer_xglm_slow([text] * repeat)
+ tokenizer_xglm_slow([text] * repeat)
end3 = time.time()
times_xglm_slow.append(end3 - start3)
start4 = time.time()
tokenizer_pythia_slow(text * repeat)
end4 = time.time()
times_pythia_slow.append(end4 - start4)
# Create a numpy array for the lengths
lengths = np.arange(1000, max, 1000)
# Plot the time measurements
plt.figure(figsize=(10, 6))
plt.plot(lengths, times_xglm, label='xglm', color="blue")
plt.plot(lengths, times_pythia, label='pythia',color="red")
plt.plot(lengths, times_xglm_slow, label='xglm slow',color="darkblue")
plt.plot(lengths, times_pythia_slow, label='pythia slow', color="darkred")
plt.xlabel('Repeats of text')
plt.ylabel('Time to tokenize (s)')
plt.legend()
plt.show()
Thank you dearly for your repeated help. That is indeed what is going on.
I want pytorch tensors out. If I do return_tensors='pt', I get an error due to padding, which is expected. In the end, I want the same result as passing the whole thing at once (or, very close). I think I have to do add_special_tokens=False for all after the first and add space (as a token) in-between the parts when concatenating? I guess this is a common problem. Is there a canonical way on how to do this?
Yes, I think I would split the data / add padding if needed but only to the last sentence . Yes a space or adding a new token like <sep>
can also help
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi all.
The tokenizers are much too slow.
I did some simple measurements using the following script:
The resulting plot looks like this:![image](https://github.com/huggingface/tokenizers/assets/19559859/1a95e314-ce61-4419-9933-83bf669366d3)
Here is the environment this ran on:
Python 3.9.17, transformers 4.28.0, tokenizers 0.13.3, sentencepiece 0.1.99, protobuf 3.20.0
.Since this is not up-to-date, I also ran the same script on a more updated environment:
In this case, the environment is:
Python 3.11.5, transformers 4.31.0, tokenizers 0.13.3, sentencepiece 0.1.99, protobuf 4.25.1
.Running on Windows, but I also saw the same behavior on a Linux system.
I am a bit baffled, but I really need to use the XGLM tokenizers. Any help is greatly appreciated!
Thanks, Jonas