Closed severinsimmler closed 4 months ago
Well, that's really weird. Such an error originating into enable_truncation
seems very unlikely, I'm confused. Having a way to reproduce this would be ideal, but otherwise, if you can provide us with a stack trace that would already be very helpful.
Here's the stack trace. The input for this is rather short (about 70 characters) and always the same (basically a health check), but I still could not reproduce it locally yet.
{
"error.culprit": "transformers.tokenization_utils_fast.set_truncation_and_padding",
"error.exception": {
"stacktrace": [
{
"filename": "transformers/tokenization_utils_base.py",
"line": {
"number": 2217,
"context": " return self.encode_plus("
},
"function": "__call__",
"module": "transformers.tokenization_utils_base",
"context": {
"pre": [" )", " else:"],
"post": [
" text=text,",
" text_pair=text_pair,"
]
},
"vars": {
"padding": false,
"is_split_into_words": true,
"is_batched": false,
"return_attention_mask": true,
"return_length": false,
"stride": 0,
"return_offsets_mapping": false,
"return_special_tokens_mask": "********",
"verbose": true,
"self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
"return_overflowing_tokens": "********",
"truncation": true,
"add_special_tokens": "********",
"max_length": 512
}
},
{
"filename": "transformers/tokenization_utils_base.py",
"line": {
"number": 2287,
"context": " return self._encode_plus("
},
"module": "transformers.tokenization_utils_base",
"function": "encode_plus",
"context": {
"pre": [" )", ""],
"post": [" text=text,", " text_pair=text_pair,"]
},
"vars": {
"padding": false,
"is_split_into_words": true,
"return_attention_mask": true,
"padding_strategy": "<PaddingStrategy.DO_NOT_PAD: 'do_not_pad'>",
"stride": 0,
"return_length": false,
"return_offsets_mapping": false,
"return_special_tokens_mask": "********",
"verbose": true,
"truncation_strategy": "<TruncationStrategy.LONGEST_FIRST: 'longest_first'>",
"self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
"return_overflowing_tokens": "********",
"truncation": true,
"add_special_tokens": "********",
"max_length": 512
}
},
{
"filename": "transformers/tokenization_utils_fast.py",
"line": {
"number": 455,
"context": " batched_output = self._batch_encode_plus("
},
"module": "transformers.tokenization_utils_fast",
"function": "_encode_plus",
"context": {
"pre": [
"",
" batched_input = [(text, text_pair)] if text_pair else [text]"
],
"post": [
" batched_input,",
" is_split_into_words=is_split_into_words,"
]
},
"vars": {
"is_split_into_words": true,
"return_attention_mask": true,
"padding_strategy": "<PaddingStrategy.DO_NOT_PAD: 'do_not_pad'>",
"stride": 0,
"return_length": false,
"return_offsets_mapping": false,
"return_special_tokens_mask": "********",
"verbose": true,
"truncation_strategy": "<TruncationStrategy.LONGEST_FIRST: 'longest_first'>",
"self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
"return_overflowing_tokens": "********",
"add_special_tokens": "********",
"max_length": 512
}
},
{
"filename": "transformers/tokenization_utils_fast.py",
"line": {
"number": 378,
"context": " self.set_truncation_and_padding("
},
"function": "_batch_encode_plus",
"module": "transformers.tokenization_utils_fast",
"context": {
"pre": [
"",
" # Set the truncation and padding strategy and restore the initial configuration"
],
"post": [
" padding_strategy=padding_strategy,",
" truncation_strategy=truncation_strategy,"
]
},
"vars": {
"is_split_into_words": true,
"return_attention_mask": true,
"padding_strategy": "<PaddingStrategy.DO_NOT_PAD: 'do_not_pad'>",
"return_length": false,
"stride": 0,
"return_offsets_mapping": false,
"return_special_tokens_mask": "********",
"verbose": true,
"truncation_strategy": "<TruncationStrategy.LONGEST_FIRST: 'longest_first'>",
"self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
"return_overflowing_tokens": "********",
"max_length": 512,
"add_special_tokens": "********"
}
},
{
"exclude_from_grouping": false,
"library_frame": false,
"filename": "transformers/tokenization_utils_fast.py",
"abs_path": "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py",
"line": {
"number": 323,
"context": " self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)"
},
"module": "transformers.tokenization_utils_fast",
"function": "set_truncation_and_padding",
"context": {
"pre": [
" # Set truncation and padding on the backend tokenizer",
" if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE:"
],
"post": [
" else:",
" self._tokenizer.no_truncation()"
]
},
"vars": {
"self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
"padding_strategy": "<PaddingStrategy.DO_NOT_PAD: 'do_not_pad'>",
"stride": 0,
"truncation_strategy": "<TruncationStrategy.LONGEST_FIRST: 'longest_first'>",
"max_length": 512
}
}
],
"handled": false,
"module": "builtins",
"message": "RuntimeError: Already borrowed",
"type": "RuntimeError"
}
}
I've just realized that this happens in transformers
and not in tokenizers
. Should I move the issue to the other repository? :grin:
Thank you very much @severinsimmler, this is very helpful. We can keep the issue open here since it is mostly related to this project, no worries!
I was not able to reproduce it, but I have an idea of how this could happen. Are you using this tokenizer from multiple python threads? Can you share a bit more about the kind of production setup you have? (like using multiple threads or process, or async, or anything like that)
The application runs in a Docker container with gunicorn
like:
$ gunicorn --workers 1 --threads 2 --worker-class gthread
Alright, that's what I feared. This is happening because you have a single tokenizer, that is used by 2 different threads. While the tokenizer is encoding (on one thread), if the other thread tries to modify it, this error happens because it cannot be modified while being used at the same time.
I think the easiest way to fix it, for now, will be to ensure you have an instance of the tokenizer for each thread.
We should be able to fix this in transformers
by making sure we update the truncation/padding info only if necessary (cc @LysandreJik @thomwolf).
And we should also be able to improve this error to make it clearer on tokenizers
.
Good discussion. But I don't quite understand why this truncation/padding info has to be global. It can be passed as a parameter so that each tokenize call will be threadsafe.
The error still exists in: transformers==4.3.2, tokenizers==0.10.1. I am using gunicorn (with threads) with flask and the error shows if parallel requests are made.
The problem does not exist in transformers==3.0.2, tokenizers==0.8.1.
Still there
This happens in TokenizerFast for me. Workaround is not using that.
Did you try not sharing the tokenizer among multiple threads ? (The easiest way to to load the tokenizer on each thread instead ?)
There are some implemented protection, but there is only so much that the lib can do against that.
How could I do that sharing ?
Instead of loading the tokenizer before the thread fork, load it afterwards.
If you use torch.Dataset for instance it means loading the tokenizer in Dataset.__init__
, instead of passing it.
I am integrating it inside tf dataset. It's tf threading vs tokenizerfast threading issue. I think.
On Wed, 2 Jun, 2021, 12:48 pm Nicolas Patry, @.***> wrote:
Instead of loading the tokenizer before the thread fork, load it afterwards.
If you use torch.Dataset for instance it means loading the tokenizer in Dataset.init, instead of passing it.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/tokenizers/issues/537#issuecomment-852802954, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KHINNFMILDQJ6LNJELTQXLLTANCNFSM4T3KE4MA .
You can also disable threading in tokenizers altogether by using the env variable:
TOKENIZERS_PARALLELISM=0
before launching your program, that might help.
Tried that buddy. Same issue :(
Any simple script to reproduce maybe ?
Sure Narsil.
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
#### Dataset Pipeline
def create_tokenize(text):
text = text.numpy().decode()
inputs = tokenizer(text, add_special_tokens=True, padding=True, return_tensors='tf')
return [tf.squeeze(inputs['input_ids']), tf.squeeze(inputs['attention_mask'])]
def create_data_map_fn_train(item):
input_ids, input_mask = tf.py_function(create_tokenize,[ item['text']], [tf.int32,tf.int32])
result = {}
result['input_ids'] = input_ids
result['input_type_ids'] = tf.zeros_like(input_ids)
result['input_mask'] = input_mask
return result
texts = {'text': ['This is sentence 1',
'This is entence 2',
'This is sentence 3',
'This is sentence 4']}
train_ds = tf.data.Dataset.from_tensor_slices(texts)
train_dataset = train_ds.map(create_data_map_fn_train, num_parallel_calls =tf.data.experimental.AUTOTUNE)
for item in train_dataset:
print(item)
You're sharing the tokenizer across thread boundaries....
Move the tokenizer declaration within the create_tokenize
and everything will work fine.
I'm not familiar enough with tensorflow, but there's probably another way to instantiate the tokenizer only once (per thread).
Thanks. It works for small data. The moment we increase the size of the data it fails.
I guess it's because you keep instantiating the tokenizer that way, there really should be a way to have it once per thread. Other options would be to batch encode the tokens of your dataset first, THEN use it in a dataset (again, I'm not using TF enough to know from the top of my head the solution).
It is the right way to go about it nonetheless, and the error you are seeing is desirable in a way, because you don't want contention around a single tokenizer. There should be very little footprint to having it on every thread.
Could you try this:
from transformers import BertTokenizerFast
import tensorflow as tf
#### Dataset Pipeline
TOKENIZER = None
def get_tokenizer():
global TOKENIZER
if TOKENIZER is None:
TOKENIZER = BertTokenizerFast.from_pretrained("bert-base-uncased")
return TOKENIZER
def create_tokenize(text):
tokenizer = get_tokenizer()
text = text.numpy().decode()
inputs = tokenizer(text, add_special_tokens=True, padding=True, return_tensors='tf')
return [tf.squeeze(inputs['input_ids']), tf.squeeze(inputs['attention_mask'])]
def create_data_map_fn_train(item):
input_ids, input_mask = tf.py_function(create_tokenize,[ item['text']], [tf.int32,tf.int32])
result = {}
result['input_ids'] = input_ids
result['input_type_ids'] = tf.zeros_like(input_ids)
result['input_mask'] = input_mask
return result
texts = {'text': ['This is sentence 1',
'This is entence 2',
'This is sentence 3',
'This is sentence 4']}
train_ds = tf.data.Dataset.from_tensor_slices(texts)
train_dataset = train_ds.map(create_data_map_fn_train, num_parallel_calls =tf.data.experimental.AUTOTUNE)
for item in train_dataset:
print(item)
It's a dirty hack but it should work as TOKENIZER, will be global but only set after the fork, so it'll end up being thread specific variable.
I can understand your effort. Its failing.
I think TF has some crazy stuffs going inside.
Failing when we have larger data. But I kind of solved it using tf.text . And its so fast.
do you mind sharing for other users maybe ?
I will share it in few days. Its messy and its useful only for TF users, which I find is very minimal these days.
Hi, I have the same problem with gunicorn. For some models, it does work but for others it fails. I notice a difference between the 2 models:
This fails:
self.token_indexer.encode(x, max_length=350, truncation=True)
This seems to work:
self.token_indexer.encode(x, truncation=True)
The tokenizer is loaded at startup in guinicorn. When I receive a request, I try to tokenize the batch of text (probably in an another thread).
Is it because the set_truncation_and_padding function tries to modify the backend tokenizer (self._tokenizer
) which is already owned by the first thread? In the second case (which work) the _tokenizer is not modified because max_length is at default.
Could we pass this as an argument of the backend encoding function instead of modifying the backend tokenizer object?
Is using directly _tokenizer
on your part possible ? (don't call tokenizer.encode anymore)
transformers
need to maintain backward compatibility and is unlikely to change any of its API.
tokenizers
is a standalone project so it probably won't make decisions just to accommodate transformers
(except very specific cases)
It seems like a threading issue.
Side note. tf-text is much faster than tokenizer (normal).
It's faster than tokenizer fast version by some extent.
tf-text : 6 seconds on 37000 text 512 length. tokenizer normal: 6 minutes on 37000 text 512 length tokenizer fast: 1 minute on 37000 text 512 length.
On Thu, 10 Jun, 2021, 2:21 pm Nicolas Patry, @.***> wrote:
Is using directly _tokenizer on your part possible ? (don't call tokenizer.encode anymore)
transformers need to maintain backward compatibility and is unlikely to change any of its API. tokenizers is a standalone project so it probably won't make decisions just to accommodate transformers (except very specific cases)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/tokenizers/issues/537#issuecomment-858440105, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KFMQ7PQYNDK2GN5MC3TSB4JZANCNFSM4T3KE4MA .
Does it do the same thing ?
From the docs, it seems to be a simple whitespace split, not really a BPE or Unigram tokenizer: https://www.tensorflow.org/tutorials/tensorflow_text/intro If this is the case, then it's perfectly normal. Raw python code might even be faster than tf.text still. Anything I'm missing ?
Yeah.
tf.text has BertTokenizer. It's whitespace + wordpiece . In general tf.text is faster. But problem is GPT2 and Roberta needs custom tokenizer.
And tf.text is required only if we want to make use of tf.data.Dataset , to prepare data on the fly.
To be frank, preprocess on the fly is something everyone is ignoring.
On Thu, 10 Jun, 2021, 6:52 pm Nicolas Patry, @.***> wrote:
Does it do the same thing ?
From the docs, it seems to be a simple whitespace split, not really a BPE or Unigram tokenizer: https://www.tensorflow.org/tutorials/tensorflow_text/intro If this is the case, then it's perfectly normal. Raw python code might even be faster than tf.text still. Anything I'm missing ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/tokenizers/issues/537#issuecomment-858618592, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KHTV4BRPFZ2QR5RSLLTSC4BVANCNFSM4T3KE4MA .
This is happening for me in the summarization pipelines as well. It's the same tokenizer error. I assume they're likely implemented in the same fashion as discussed in this thread.
@tyler-ground do you have an example to reproduce maybe ?
I am having the same problem. Simple reproduction would be:
@oborchers that's actually quite normal.
I would need to dive to see exactly what's causing the underlying issue, but sharing the tokenizer across threads is not recommended, there are tentative safeguards in place but they cannot always succeed. Usually we recommend giving each thread its own tokenizer (usually lightweight compared to models).
If you can provide a script (or docker image) that can give consistent errors that would be helpful too as it seems not trivial to reproduce consistently on our end.
Note, sharing the model too across threads is also going to lead to issues most likely (as mentionned here: https://github.com/deepset-ai/haystack/issues/1228). This is not a trivial problem.
@Narsil - I can confirm the observation of @oborchers
I can reproduce with these two:
# server.py
from allennlp.predictors.predictor import Predictor
from fastapi import FastAPI
app = FastAPI()
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/transformer-qa.2021-02-11.tar.gz")
@app.get("/predict")
def predict_answer(passage: str, question: str):
result = predictor.predict(
passage=passage,
question=question
)
return result["best_span_str"]
# client.py
import asyncio
import aiohttp
async def main():
url = "http://localhost:8000/predict"
params = dict(
passage="The Matrix is a 1999 science fiction action film written and directed by The Wachowskis, starring Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, and Joe Pantoliano.",
question="Who stars in The Matrix?",
)
coros = (fetch(url, params) for _ in range(2))
await asyncio.gather(*coros)
async def fetch(url, params=None):
async with aiohttp.ClientSession() as session:
async with session.get(url, params=params) as response:
print(await response.json())
if __name__ == "__main__":
asyncio.run(main())
If you change the client to fetch only 1 coro you do not hit the error. But if you have 2 you get RuntimeError: Already borrowed
Thanks for providing a solid testing script @jackhodkinson
I have created a PR within transformers
to reduce the amount of such errors: https://github.com/huggingface/transformers/pull/12550
Unfortunately, there's no way to completely erase those errors without a major revamp of the encode
function as truncation
and padding
are part of the core struct of a tokenizer
. I think it should cover 99% of the cases though because padding and truncation options, shouldn't ever be changed that often in reality.
Please read the PR for more details about what the problem is and how it attempts to solve it.
@jackhodkinson : Thank you very much for a reproducible! @Narsil: Thanks for tackling the issue so super fast. Will check when back from holiday 💯
For those who may not be able to use the latest branch of this repository due to experimental work or other custom modifications: Wrapping the request into a mutex acquire/release statement does the job as well, as done here.
from threading import Lock
MUTEX = Lock()
MUTEX.acquire()
try:
input_ids = self.tokenizer(...)
output = self.model(...)
finally:
MUTEX.release()
I want to add a comment to illustrate a specific example for which we found a workaround.
We also faced this error when running preprocessing on aiohttp API with concurrent requests. Neither #12550 nor setting TOKENIZERS_PARALLELISM=0
helped with it.
Our preprocessing logic is made of 2 steps:
Here is a full example to reproduce the error:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "0"
from concurrent.futures import ThreadPoolExecutor
from transformers import RobertaTokenizerFast
PARALLELISM = 2
tokenizer = RobertaTokenizerFast.from_pretrained("./tokenizer/")
raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature, discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC.
This book is a treatise on the theory of ethics, very popular during the Renaissance. The
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""
def preprocess_text(text, tokenizer, max_length=512):
sentences = text.split("\n")
sentences_to_keep = []
sentences_to_keep_nbr_token = 0
for sentence in sentences:
tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
sentences_to_keep.append(sentence)
sentences_to_keep_nbr_token += tokens_nbr
else:
break
return tokenizer(" ".join(sentences_to_keep),
padding='max_length',
max_length=max_length)
with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
futures = [executor.submit(preprocess_text, raw_text, tokenizer) for i in range(PARALLELISM)]
return_value = [future.result() for future in futures]
Even a parallelism of 2
is enough to trigger the RuntimeError: Already borrowed
.
The warkaround we foud for this situation is to create 2 seprate instance of tokenizers, one for each truncation/padding configuration:
tokenizer_a
for tokenizing without padding/truncationtokenizer_b
for tokenizing with padding max_length and truncation ignoredSo by changing the code as the following we no more have this error even with more concurrency:
tokenizer_a = RobertaTokenizerFast.from_pretrained("./tokenizer/")
tokenizer_b = RobertaTokenizerFast.from_pretrained("./tokenizer/")
def preprocess_text(text, tokenizer_a, tokenizer_b, max_length=512):
sentences = text.split("\n")
sentences_to_keep = []
sentences_to_keep_nbr_token = 0
for sentence in sentences:
tokens_nbr = len(tokenizer_a(text, return_special_tokens_mask=True)["input_ids"])
if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
sentences_to_keep.append(sentence)
sentences_to_keep_nbr_token += tokens_nbr
else:
break
return tokenizer_b(" ".join(sentences_to_keep),
padding='max_length',
max_length=max_length)
with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
futures = [executor.submit(preprocess_text, raw_text, tokenizer_a, tokenizer_b) for i in range(100)]
return_value = [future.result() for future in futures]
I hope this may be helpful for some of you.
Yes, you cannot do this.
tokenizer
is thread-safe, but not meant to be used concurrently (hence the error which safe 2 threads are trying to access the same thing at the same time, which is not allowed)
import os
os.environ["TOKENIZERS_PARALLELISM"] = "0"
from concurrent.futures import ThreadPoolExecutor
from transformers import AutoTokenizer
PARALLELISM = 2
raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature, discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC.
This book is a treatise on the theory of ethics, very popular during the Renaissance. The
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""
def preprocess_text(text, max_length=512):
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
sentences = text.split("\n")
sentences_to_keep = []
sentences_to_keep_nbr_token = 0
for sentence in sentences:
tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
sentences_to_keep.append(sentence)
sentences_to_keep_nbr_token += tokens_nbr
else:
break
return tokenizer(
" ".join(sentences_to_keep), padding="max_length", max_length=max_length
)
with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
futures = [executor.submit(preprocess_text, raw_text) for i in range(PARALLELISM)]
return_value = [future.result() for future in futures]
print(return_value)
This works for instance (each thread gets its own copy of the tokenizer).
In the case where you are reusing threads for more tasks:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "0"
import threading
from concurrent.futures import ThreadPoolExecutor
from transformers import AutoTokenizer
PARALLELISM = 2
raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature, discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC.
This book is a treatise on the theory of ethics, very popular during the Renaissance. The
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""
TOKENIZER = {}
def get_tokenizer():
_id = threading.get_ident()
tokenizer = TOKENIZER.get(_id, None)
if tokenizer is None:
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
TOKENIZER[_id] = tokenizer
return tokenizer
def preprocess_text(text, max_length=512):
tokenizer = get_tokenizer()
sentences = text.split("\n")
sentences_to_keep = []
sentences_to_keep_nbr_token = 0
for sentence in sentences:
tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
sentences_to_keep.append(sentence)
sentences_to_keep_nbr_token += tokens_nbr
else:
break
return tokenizer(
" ".join(sentences_to_keep), padding="max_length", max_length=max_length
)
with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
futures = [executor.submit(preprocess_text, raw_text) for i in range(PARALLELISM)]
return_value = [future.result() for future in futures]
print(return_value)
should work, and each thread will get its tokenizer.
Sharing tokenizer across threads is fixable but not desirable, it will just slow everything down since it's likely we'll just mutex it causing each thread to wait their turn for each other. Given that tokenizers are relatively small objects, making each thread have its own seems better.
Lock-free sharing is just too complex for what it would bring (and prevent ANY modification of the underlying tokenizer which is what you are doing without realizing).
tokenizer(...)
and tokenizer(..., padding="max_length")
need to modify the underlying object since the padding strategy is part of it.
As a side note, another way to fix it (which I don't recommend) is:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "0"
from concurrent.futures import ThreadPoolExecutor
from transformers import RobertaTokenizerFast
PARALLELISM = 2
tokenizer = RobertaTokenizerFast.from_pretrained("xlm-roberta-base")
tokenizer2 = RobertaTokenizerFast.from_pretrained("xlm-roberta-base")
# This mutates tokenizer2 to include the strategy before sharing
tokenizer2("test", padding="max_length", max_length=512)
raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature, discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC.
This book is a treatise on the theory of ethics, very popular during the Renaissance. The
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""
def preprocess_text(text, tokenizer, max_length=512):
sentences = text.split("\n")
sentences_to_keep = []
sentences_to_keep_nbr_token = 0
for sentence in sentences:
tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
sentences_to_keep.append(sentence)
sentences_to_keep_nbr_token += tokens_nbr
else:
break
return tokenizer2(
" ".join(sentences_to_keep), padding="max_length", max_length=max_length
)
with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
futures = [
executor.submit(preprocess_text, raw_text, tokenizer)
for i in range(PARALLELISM)
]
return_value = [future.result() for future in futures]
print(return_value)
Thanks @Narsil for further explanations and ideas.
I used threads to simplify the example, but in fact our usecase uses asyncio
with a thread pool. So it's even more nasty to handle the pool of tokenizers, but should be feasible if really needed. We don't need heavy parallel preprocessing, but rather a good response time with some concurrency from time to time.
In general, I would naturally expect `tokenizer(text)" call to be stateless so independent between concurrent calls. Although, I understand it's not possible regarding the current architecture of fast tokenizer having the Rust backend.
It's about the choice that was done about padding_strategy.
Making it stateless means that every single call from python to rust needs to pass it to the caller. Meaning there's is a string passing the Python->Rust boundary for every single call.
It turns out that Python -> Rust is not a free boundary, some calculations have to happen. We didn't make actual measurements, but it could be quite hurtful to make rust purely stateless.
Since in most cases users use either padding or no padding strategy (usually training vs inference) then being it stateful is correct in most cases. The last version showcases how to actually have only 2 stateless tokenizers.
Hope that helps.
asyncio
doesn't change anything to how your example should fail. It's the threading that's causing issues, not async (since tokenizers
will block the thread anyway)
This may come incredibly late, but if you are working with micro-services and are willing to exchange the base call by a post request I would much rather suggest to:
All my scaling and threading headaches when working with this in pure fastapi/flask fashion are resolved since then.
The easiest and least intrusive way IMHO is using a Python queue
, which is multi-threaded per-se.
Let's assume you have N
threads, instead of creating one tokenizer instance per thread, one creates M
tokenizer instances, where M
could only be 1 as a default value - which is equivalent of using a simple lock. Inside the initialization you put M
instances of your tokenizer into the queue and afterwards only use queue.get()
and queue.put()
when you need to access any of these instances.
The latter should be done inside a try: .. finalize:
block, so that a tokenizer always is guaranteed to be returned into the queue again e.g. in case of exceptions.
queue.get()
will block the calling thread as long as there are no available free tokenizer instances and will immediately unblock the thread if another thread puts a tokenizer back to the queue. As Python queues are FIFO's, it's also guaranteed that all elements inside the queue are used round-robin.
The necessary code is minimal and always thread-safe and you can decouple the number of your threads from the number of your tokenizer instances. This makes the resource usage very controllable as well.
The problem with the above approach, is that as long as M
is less than N
, there will be thread-contention in heavy load situations. Most normal operating systems don't make any guarantees for waiting threads to be scheduled in FIFO-order. This means, there is no latency guarantee for e.g. your gRPC or webserver thread to get hold of a tokenizer instance before another thread that took the queue later. In most cases this is not an issue, but if your server is under heavy load, that's the reason why you often see high latency spikes. There is a reason there are realtime operating systems out there that make those guarantees.
I.e. if you are requiring strict latency timelines, you need to have M
== N
.
The issue here is not a tokenizer bug, it's a misunderstanding of the user about the guarantees that the tokenizer package makes in terms of multi-threading. If a package is not multi-threading safe, the user needs to take care of the consequences and one shouldn't assume it's a bug of the package, because thread-safe operation has overheads, especially in those interpreter languages as Python with one GIL and if you only want to use one thread, this overhead shouldn't be the default.
Did I understand correctly that it is not a bug, but a slight misunderstanding of the non-thread safe nature of the python->rust boundary? Should it be closed then? Or maybe, as a part of a fix, one should compute what would it cost to make the calls stateless?
I solved this problem using a thread lock in Python.
from threading import Lock
lock = Lock()
lock.acquire()
model.encode()
lock.release()
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
We're using
transformers
(3.5.0) with a fast tokenizer (0.9.3) in production, but sometimes aRuntimeError
withAlready borrowed
is raised (this might come from Rusts's borrowing mechanisms?). This happens actually quite often, but I'm not sure yet why and how to reproduce this.However, this is where the error is raised:
https://github.com/huggingface/tokenizers/blob/598ce61229c789465966682687fa12a90ec58074/bindings/python/py_src/tokenizers/implementations/base_tokenizer.py#L107-L123