RuntimeError: Already borrowed

severinsimmler commented 3 years ago

We're using transformers (3.5.0) with a fast tokenizer (0.9.3) in production, but sometimes a RuntimeError with Already borrowed is raised (this might come from Rusts's borrowing mechanisms?). This happens actually quite often, but I'm not sure yet why and how to reproduce this.

However, this is where the error is raised:

https://github.com/huggingface/tokenizers/blob/598ce61229c789465966682687fa12a90ec58074/bindings/python/py_src/tokenizers/implementations/base_tokenizer.py#L107-L123

n1t0 commented 3 years ago

Well, that's really weird. Such an error originating into enable_truncation seems very unlikely, I'm confused. Having a way to reproduce this would be ideal, but otherwise, if you can provide us with a stack trace that would already be very helpful.

severinsimmler commented 3 years ago

Here's the stack trace. The input for this is rather short (about 70 characters) and always the same (basically a health check), but I still could not reproduce it locally yet.

{
  "error.culprit": "transformers.tokenization_utils_fast.set_truncation_and_padding",
  "error.exception": {
    "stacktrace": [
      {
        "filename": "transformers/tokenization_utils_base.py",
        "line": {
          "number": 2217,
          "context": "            return self.encode_plus("
        },
        "function": "__call__",
        "module": "transformers.tokenization_utils_base",
        "context": {
          "pre": ["            )", "        else:"],
          "post": [
            "                text=text,",
            "                text_pair=text_pair,"
          ]
        },
        "vars": {
          "padding": false,
          "is_split_into_words": true,
          "is_batched": false,
          "return_attention_mask": true,
          "return_length": false,
          "stride": 0,
          "return_offsets_mapping": false,
          "return_special_tokens_mask": "********",
          "verbose": true,
          "self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
          "return_overflowing_tokens": "********",
          "truncation": true,
          "add_special_tokens": "********",
          "max_length": 512
        }
      },
      {
        "filename": "transformers/tokenization_utils_base.py",
        "line": {
          "number": 2287,
          "context": "        return self._encode_plus("
        },
        "module": "transformers.tokenization_utils_base",
        "function": "encode_plus",
        "context": {
          "pre": ["        )", ""],
          "post": ["            text=text,", "            text_pair=text_pair,"]
        },
        "vars": {
          "padding": false,
          "is_split_into_words": true,
          "return_attention_mask": true,
          "padding_strategy": "<PaddingStrategy.DO_NOT_PAD: 'do_not_pad'>",
          "stride": 0,
          "return_length": false,
          "return_offsets_mapping": false,
          "return_special_tokens_mask": "********",
          "verbose": true,
          "truncation_strategy": "<TruncationStrategy.LONGEST_FIRST: 'longest_first'>",
          "self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
          "return_overflowing_tokens": "********",
          "truncation": true,
          "add_special_tokens": "********",
          "max_length": 512
        }
      },
      {
        "filename": "transformers/tokenization_utils_fast.py",
        "line": {
          "number": 455,
          "context": "        batched_output = self._batch_encode_plus("
        },
        "module": "transformers.tokenization_utils_fast",
        "function": "_encode_plus",
        "context": {
          "pre": [
            "",
            "        batched_input = [(text, text_pair)] if text_pair else [text]"
          ],
          "post": [
            "            batched_input,",
            "            is_split_into_words=is_split_into_words,"
          ]
        },
        "vars": {
          "is_split_into_words": true,
          "return_attention_mask": true,
          "padding_strategy": "<PaddingStrategy.DO_NOT_PAD: 'do_not_pad'>",
          "stride": 0,
          "return_length": false,
          "return_offsets_mapping": false,
          "return_special_tokens_mask": "********",
          "verbose": true,
          "truncation_strategy": "<TruncationStrategy.LONGEST_FIRST: 'longest_first'>",
          "self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
          "return_overflowing_tokens": "********",
          "add_special_tokens": "********",
          "max_length": 512
        }
      },
      {
        "filename": "transformers/tokenization_utils_fast.py",
        "line": {
          "number": 378,
          "context": "        self.set_truncation_and_padding("
        },
        "function": "_batch_encode_plus",
        "module": "transformers.tokenization_utils_fast",
        "context": {
          "pre": [
            "",
            "        # Set the truncation and padding strategy and restore the initial configuration"
          ],
          "post": [
            "            padding_strategy=padding_strategy,",
            "            truncation_strategy=truncation_strategy,"
          ]
        },
        "vars": {
          "is_split_into_words": true,
          "return_attention_mask": true,
          "padding_strategy": "<PaddingStrategy.DO_NOT_PAD: 'do_not_pad'>",
          "return_length": false,
          "stride": 0,
          "return_offsets_mapping": false,
          "return_special_tokens_mask": "********",
          "verbose": true,
          "truncation_strategy": "<TruncationStrategy.LONGEST_FIRST: 'longest_first'>",
          "self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
          "return_overflowing_tokens": "********",
          "max_length": 512,
          "add_special_tokens": "********"
        }
      },
      {
        "exclude_from_grouping": false,
        "library_frame": false,
        "filename": "transformers/tokenization_utils_fast.py",
        "abs_path": "/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py",
        "line": {
          "number": 323,
          "context": "            self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)"
        },
        "module": "transformers.tokenization_utils_fast",
        "function": "set_truncation_and_padding",
        "context": {
          "pre": [
            "        # Set truncation and padding on the backend tokenizer",
            "        if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE:"
          ],
          "post": [
            "        else:",
            "            self._tokenizer.no_truncation()"
          ]
        },
        "vars": {
          "self": "PreTrainedTokenizerFast(name_or_path='/opt/model', vocab_size=250002, model_max_len=512, is_fast=True, ...",
          "padding_strategy": "<PaddingStrategy.DO_NOT_PAD: 'do_not_pad'>",
          "stride": 0,
          "truncation_strategy": "<TruncationStrategy.LONGEST_FIRST: 'longest_first'>",
          "max_length": 512
        }
      }
    ],
    "handled": false,
    "module": "builtins",
    "message": "RuntimeError: Already borrowed",
    "type": "RuntimeError"
  }
}

severinsimmler commented 3 years ago

I've just realized that this happens in transformers and not in tokenizers. Should I move the issue to the other repository? :grin:

n1t0 commented 3 years ago

Thank you very much @severinsimmler, this is very helpful. We can keep the issue open here since it is mostly related to this project, no worries!

I was not able to reproduce it, but I have an idea of how this could happen. Are you using this tokenizer from multiple python threads? Can you share a bit more about the kind of production setup you have? (like using multiple threads or process, or async, or anything like that)

severinsimmler commented 3 years ago

The application runs in a Docker container with gunicorn like:

$ gunicorn --workers 1 --threads 2 --worker-class gthread

n1t0 commented 3 years ago

Alright, that's what I feared. This is happening because you have a single tokenizer, that is used by 2 different threads. While the tokenizer is encoding (on one thread), if the other thread tries to modify it, this error happens because it cannot be modified while being used at the same time.

I think the easiest way to fix it, for now, will be to ensure you have an instance of the tokenizer for each thread.

We should be able to fix this in transformers by making sure we update the truncation/padding info only if necessary (cc @LysandreJik @thomwolf). And we should also be able to improve this error to make it clearer on tokenizers.

hankcs commented 3 years ago

Good discussion. But I don't quite understand why this truncation/padding info has to be global. It can be passed as a parameter so that each tokenize call will be threadsafe.

djstrong commented 3 years ago

The error still exists in: transformers==4.3.2, tokenizers==0.10.1. I am using gunicorn (with threads) with flask and the error shows if parallel requests are made.

The problem does not exist in transformers==3.0.2, tokenizers==0.8.1.

s4sarath commented 3 years ago

Still there

s4sarath commented 3 years ago

This happens in TokenizerFast for me. Workaround is not using that.

Narsil commented 3 years ago

Did you try not sharing the tokenizer among multiple threads ? (The easiest way to to load the tokenizer on each thread instead ?)

There are some implemented protection, but there is only so much that the lib can do against that.

s4sarath commented 3 years ago

How could I do that sharing ?

Narsil commented 3 years ago

Instead of loading the tokenizer before the thread fork, load it afterwards.

If you use torch.Dataset for instance it means loading the tokenizer in Dataset.__init__, instead of passing it.

s4sarath commented 3 years ago

I am integrating it inside tf dataset. It's tf threading vs tokenizerfast threading issue. I think.

On Wed, 2 Jun, 2021, 12:48 pm Nicolas Patry, @.***> wrote:

Instead of loading the tokenizer before the thread fork, load it afterwards.

If you use torch.Dataset for instance it means loading the tokenizer in Dataset.init, instead of passing it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/tokenizers/issues/537#issuecomment-852802954, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KHINNFMILDQJ6LNJELTQXLLTANCNFSM4T3KE4MA .

Narsil commented 3 years ago

You can also disable threading in tokenizers altogether by using the env variable:

TOKENIZERS_PARALLELISM=0 before launching your program, that might help.

s4sarath commented 3 years ago

Tried that buddy. Same issue :(

Narsil commented 3 years ago

Any simple script to reproduce maybe ?

s4sarath commented 3 years ago

Sure Narsil.

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

#### Dataset Pipeline
def create_tokenize(text):
    text = text.numpy().decode()
    inputs = tokenizer(text, add_special_tokens=True, padding=True, return_tensors='tf')
    return [tf.squeeze(inputs['input_ids']), tf.squeeze(inputs['attention_mask'])]

def create_data_map_fn_train(item):
    input_ids, input_mask = tf.py_function(create_tokenize,[ item['text']], [tf.int32,tf.int32])
    result = {}
    result['input_ids']  = input_ids
    result['input_type_ids'] = tf.zeros_like(input_ids)
    result['input_mask']  = input_mask

    return result

texts = {'text': ['This is sentence 1', 
        'This is entence 2', 
        'This is sentence 3', 
        'This is sentence 4']}

train_ds  = tf.data.Dataset.from_tensor_slices(texts)
train_dataset = train_ds.map(create_data_map_fn_train, num_parallel_calls =tf.data.experimental.AUTOTUNE)

for item in train_dataset:
    print(item)

Narsil commented 3 years ago

You're sharing the tokenizer across thread boundaries....

Move the tokenizer declaration within the create_tokenize and everything will work fine.

I'm not familiar enough with tensorflow, but there's probably another way to instantiate the tokenizer only once (per thread).

s4sarath commented 3 years ago

Thanks. It works for small data. The moment we increase the size of the data it fails.

Narsil commented 3 years ago

I guess it's because you keep instantiating the tokenizer that way, there really should be a way to have it once per thread. Other options would be to batch encode the tokens of your dataset first, THEN use it in a dataset (again, I'm not using TF enough to know from the top of my head the solution).

It is the right way to go about it nonetheless, and the error you are seeing is desirable in a way, because you don't want contention around a single tokenizer. There should be very little footprint to having it on every thread.

Could you try this:

from transformers import BertTokenizerFast
import tensorflow as tf

#### Dataset Pipeline

TOKENIZER = None
def get_tokenizer():
    global TOKENIZER
    if TOKENIZER is None:
        TOKENIZER = BertTokenizerFast.from_pretrained("bert-base-uncased")
    return TOKENIZER

def create_tokenize(text):
    tokenizer = get_tokenizer()
    text = text.numpy().decode()
    inputs = tokenizer(text, add_special_tokens=True, padding=True, return_tensors='tf')
    return [tf.squeeze(inputs['input_ids']), tf.squeeze(inputs['attention_mask'])]

def create_data_map_fn_train(item):
    input_ids, input_mask = tf.py_function(create_tokenize,[ item['text']], [tf.int32,tf.int32])
    result = {}
    result['input_ids']  = input_ids
    result['input_type_ids'] = tf.zeros_like(input_ids)
    result['input_mask']  = input_mask

    return result

texts = {'text': ['This is sentence 1', 
        'This is entence 2', 
        'This is sentence 3', 
        'This is sentence 4']}

train_ds  = tf.data.Dataset.from_tensor_slices(texts)
train_dataset = train_ds.map(create_data_map_fn_train, num_parallel_calls =tf.data.experimental.AUTOTUNE)

for item in train_dataset:
    print(item)

It's a dirty hack but it should work as TOKENIZER, will be global but only set after the fork, so it'll end up being thread specific variable.

s4sarath commented 3 years ago

I can understand your effort. Its failing.

I think TF has some crazy stuffs going inside.

s4sarath commented 3 years ago

Failing when we have larger data. But I kind of solved it using tf.text . And its so fast.

Narsil commented 3 years ago

do you mind sharing for other users maybe ?

s4sarath commented 3 years ago

I will share it in few days. Its messy and its useful only for TF users, which I find is very minimal these days.

gbmarc1 commented 3 years ago

Hi, I have the same problem with gunicorn. For some models, it does work but for others it fails. I notice a difference between the 2 models:

This fails: self.token_indexer.encode(x, max_length=350, truncation=True) This seems to work:

self.token_indexer.encode(x, truncation=True)

The tokenizer is loaded at startup in guinicorn. When I receive a request, I try to tokenize the batch of text (probably in an another thread). Is it because the set_truncation_and_padding function tries to modify the backend tokenizer (self._tokenizer) which is already owned by the first thread? In the second case (which work) the _tokenizer is not modified because max_length is at default.

Could we pass this as an argument of the backend encoding function instead of modifying the backend tokenizer object?

Narsil commented 3 years ago

Is using directly _tokenizer on your part possible ? (don't call tokenizer.encode anymore)

transformers need to maintain backward compatibility and is unlikely to change any of its API. tokenizers is a standalone project so it probably won't make decisions just to accommodate transformers (except very specific cases)

s4sarath commented 3 years ago

It seems like a threading issue.

Side note. tf-text is much faster than tokenizer (normal).

It's faster than tokenizer fast version by some extent.

tf-text : 6 seconds on 37000 text 512 length. tokenizer normal: 6 minutes on 37000 text 512 length tokenizer fast: 1 minute on 37000 text 512 length.

On Thu, 10 Jun, 2021, 2:21 pm Nicolas Patry, @.***> wrote:

Is using directly _tokenizer on your part possible ? (don't call tokenizer.encode anymore)

transformers need to maintain backward compatibility and is unlikely to change any of its API. tokenizers is a standalone project so it probably won't make decisions just to accommodate transformers (except very specific cases)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/tokenizers/issues/537#issuecomment-858440105, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KFMQ7PQYNDK2GN5MC3TSB4JZANCNFSM4T3KE4MA .

Narsil commented 3 years ago

Does it do the same thing ?

From the docs, it seems to be a simple whitespace split, not really a BPE or Unigram tokenizer: https://www.tensorflow.org/tutorials/tensorflow_text/intro If this is the case, then it's perfectly normal. Raw python code might even be faster than tf.text still. Anything I'm missing ?

s4sarath commented 3 years ago

Yeah.

tf.text has BertTokenizer. It's whitespace + wordpiece . In general tf.text is faster. But problem is GPT2 and Roberta needs custom tokenizer.

And tf.text is required only if we want to make use of tf.data.Dataset , to prepare data on the fly.

To be frank, preprocess on the fly is something everyone is ignoring.

On Thu, 10 Jun, 2021, 6:52 pm Nicolas Patry, @.***> wrote:

Does it do the same thing ?

From the docs, it seems to be a simple whitespace split, not really a BPE or Unigram tokenizer: https://www.tensorflow.org/tutorials/tensorflow_text/intro If this is the case, then it's perfectly normal. Raw python code might even be faster than tf.text still. Anything I'm missing ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/tokenizers/issues/537#issuecomment-858618592, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KHTV4BRPFZ2QR5RSLLTSC4BVANCNFSM4T3KE4MA .

tyler-ground commented 3 years ago

This is happening for me in the summarization pipelines as well. It's the same tokenizer error. I assume they're likely implemented in the same fashion as discussed in this thread.

Narsil commented 3 years ago

@tyler-ground do you have an example to reproduce maybe ?

oborchers commented 3 years ago

I am having the same problem. Simple reproduction would be:

FastAPI endpoint
TextClassifier pipeline loaded and stored under app.states
Query runs the pipeline
Query multiple requests at the same time

Narsil commented 3 years ago

@oborchers that's actually quite normal.

I would need to dive to see exactly what's causing the underlying issue, but sharing the tokenizer across threads is not recommended, there are tentative safeguards in place but they cannot always succeed. Usually we recommend giving each thread its own tokenizer (usually lightweight compared to models).

If you can provide a script (or docker image) that can give consistent errors that would be helpful too as it seems not trivial to reproduce consistently on our end.

Note, sharing the model too across threads is also going to lead to issues most likely (as mentionned here: https://github.com/deepset-ai/haystack/issues/1228). This is not a trivial problem.

jackhodkinson commented 3 years ago

@Narsil - I can confirm the observation of @oborchers

I can reproduce with these two:

# server.py
from allennlp.predictors.predictor import Predictor
from fastapi import FastAPI

app = FastAPI()
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/transformer-qa.2021-02-11.tar.gz")

@app.get("/predict")
def predict_answer(passage: str, question: str):
    result = predictor.predict(
        passage=passage,
        question=question
    )
    return result["best_span_str"]

# client.py
import asyncio

import aiohttp

async def main():
    url = "http://localhost:8000/predict"
    params = dict(
        passage="The Matrix is a 1999 science fiction action film written and directed by The Wachowskis, starring Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, and Joe Pantoliano.",
        question="Who stars in The Matrix?",
    )
    coros = (fetch(url, params) for _ in range(2))
    await asyncio.gather(*coros)

async def fetch(url, params=None):
    async with aiohttp.ClientSession() as session:
        async with session.get(url, params=params) as response:
            print(await response.json())

if __name__ == "__main__":
    asyncio.run(main())

If you change the client to fetch only 1 coro you do not hit the error. But if you have 2 you get RuntimeError: Already borrowed

Narsil commented 3 years ago

Thanks for providing a solid testing script @jackhodkinson I have created a PR within transformers to reduce the amount of such errors: https://github.com/huggingface/transformers/pull/12550

Unfortunately, there's no way to completely erase those errors without a major revamp of the encode function as truncation and padding are part of the core struct of a tokenizer. I think it should cover 99% of the cases though because padding and truncation options, shouldn't ever be changed that often in reality.

Please read the PR for more details about what the problem is and how it attempts to solve it.

oborchers commented 3 years ago

@jackhodkinson : Thank you very much for a reproducible! @Narsil: Thanks for tackling the issue so super fast. Will check when back from holiday 💯

oborchers commented 3 years ago

For those who may not be able to use the latest branch of this repository due to experimental work or other custom modifications: Wrapping the request into a mutex acquire/release statement does the job as well, as done here.

from threading import Lock
MUTEX = Lock()

MUTEX.acquire()
try:
    input_ids = self.tokenizer(...)
    output = self.model(...)
finally:
    MUTEX.release()

radcheb commented 1 year ago

I want to add a comment to illustrate a specific example for which we found a workaround. We also faced this error when running preprocessing on aiohttp API with concurrent requests. Neither #12550 nor setting TOKENIZERS_PARALLELISM=0 helped with it. Our preprocessing logic is made of 2 steps:

Tokenize individual sentences (so without padding) to get the number of token in each sentences
Combine sentences up to max length (Either we take a full sentence or we drop it) using their number of tokens to guarantee final max length
Tokenize combined sentences (with padding enabled to max length)

Here is a full example to reproduce the error:

import os

os.environ["TOKENIZERS_PARALLELISM"] = "0"

from concurrent.futures import ThreadPoolExecutor
from transformers import RobertaTokenizerFast

PARALLELISM = 2
tokenizer = RobertaTokenizerFast.from_pretrained("./tokenizer/")

raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text. 
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, 
and going through the cites of the word in classical literature, discovered the undoubtable source. 
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. 
This book is a treatise on the theory of ethics, very popular during the Renaissance. The 
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""

def preprocess_text(text, tokenizer, max_length=512):
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer(" ".join(sentences_to_keep),
                     padding='max_length',
                     max_length=max_length)

with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [executor.submit(preprocess_text, raw_text, tokenizer) for i in range(PARALLELISM)]
    return_value = [future.result() for future in futures]

Even a parallelism of 2 is enough to trigger the RuntimeError: Already borrowed.

The warkaround we foud for this situation is to create 2 seprate instance of tokenizers, one for each truncation/padding configuration:

tokenizer_a for tokenizing without padding/truncation
tokenizer_b for tokenizing with padding max_length and truncation ignored

So by changing the code as the following we no more have this error even with more concurrency:

tokenizer_a = RobertaTokenizerFast.from_pretrained("./tokenizer/")
tokenizer_b = RobertaTokenizerFast.from_pretrained("./tokenizer/")

def preprocess_text(text, tokenizer_a, tokenizer_b, max_length=512):
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer_a(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer_b(" ".join(sentences_to_keep),
                       padding='max_length',
                       max_length=max_length)

with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [executor.submit(preprocess_text, raw_text, tokenizer_a, tokenizer_b) for i in range(100)]
    return_value = [future.result() for future in futures]

I hope this may be helpful for some of you.

Narsil commented 1 year ago

Yes, you cannot do this.

tokenizer is thread-safe, but not meant to be used concurrently (hence the error which safe 2 threads are trying to access the same thing at the same time, which is not allowed)

import os

os.environ["TOKENIZERS_PARALLELISM"] = "0"

from concurrent.futures import ThreadPoolExecutor
from transformers import AutoTokenizer

PARALLELISM = 2

raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature, discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC.
This book is a treatise on the theory of ethics, very popular during the Renaissance. The
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""

def preprocess_text(text, max_length=512):
    tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer(
        " ".join(sentences_to_keep), padding="max_length", max_length=max_length
    )

with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [executor.submit(preprocess_text, raw_text) for i in range(PARALLELISM)]
    return_value = [future.result() for future in futures]
    print(return_value)

This works for instance (each thread gets its own copy of the tokenizer).

In the case where you are reusing threads for more tasks:

import os

os.environ["TOKENIZERS_PARALLELISM"] = "0"

import threading
from concurrent.futures import ThreadPoolExecutor
from transformers import AutoTokenizer

PARALLELISM = 2

raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text. 
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, 
and going through the cites of the word in classical literature, discovered the undoubtable source. 
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. 
This book is a treatise on the theory of ethics, very popular during the Renaissance. The 
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""

TOKENIZER = {}

def get_tokenizer():
    _id = threading.get_ident()
    tokenizer = TOKENIZER.get(_id, None)
    if tokenizer is None:
        tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
        TOKENIZER[_id] = tokenizer
    return tokenizer

def preprocess_text(text, max_length=512):
    tokenizer = get_tokenizer()
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer(
        " ".join(sentences_to_keep), padding="max_length", max_length=max_length
    )

with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [executor.submit(preprocess_text, raw_text) for i in range(PARALLELISM)]
    return_value = [future.result() for future in futures]
    print(return_value)

should work, and each thread will get its tokenizer.

Sharing tokenizer across threads is fixable but not desirable, it will just slow everything down since it's likely we'll just mutex it causing each thread to wait their turn for each other. Given that tokenizers are relatively small objects, making each thread have its own seems better.

Lock-free sharing is just too complex for what it would bring (and prevent ANY modification of the underlying tokenizer which is what you are doing without realizing).

tokenizer(...) and tokenizer(..., padding="max_length") need to modify the underlying object since the padding strategy is part of it.

As a side note, another way to fix it (which I don't recommend) is:

import os

os.environ["TOKENIZERS_PARALLELISM"] = "0"

from concurrent.futures import ThreadPoolExecutor
from transformers import RobertaTokenizerFast

PARALLELISM = 2
tokenizer = RobertaTokenizerFast.from_pretrained("xlm-roberta-base")
tokenizer2 = RobertaTokenizerFast.from_pretrained("xlm-roberta-base")
# This mutates tokenizer2 to include the strategy before sharing
tokenizer2("test", padding="max_length", max_length=512)

raw_text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Contrary to popular belief, Lorem Ipsum is not simply random text. 
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. 
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, 
and going through the cites of the word in classical literature, discovered the undoubtable source. 
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. 
This book is a treatise on the theory of ethics, very popular during the Renaissance. The 
first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.
"""

def preprocess_text(text, tokenizer, max_length=512):
    sentences = text.split("\n")
    sentences_to_keep = []
    sentences_to_keep_nbr_token = 0
    for sentence in sentences:
        tokens_nbr = len(tokenizer(text, return_special_tokens_mask=True)["input_ids"])
        if sentences_to_keep_nbr_token + tokens_nbr <= max_length:
            sentences_to_keep.append(sentence)
            sentences_to_keep_nbr_token += tokens_nbr
        else:
            break
    return tokenizer2(
        " ".join(sentences_to_keep), padding="max_length", max_length=max_length
    )

with ThreadPoolExecutor(max_workers=PARALLELISM) as executor:
    futures = [
        executor.submit(preprocess_text, raw_text, tokenizer)
        for i in range(PARALLELISM)
    ]
    return_value = [future.result() for future in futures]
    print(return_value)

radcheb commented 1 year ago

Thanks @Narsil for further explanations and ideas.

I used threads to simplify the example, but in fact our usecase uses asyncio with a thread pool. So it's even more nasty to handle the pool of tokenizers, but should be feasible if really needed. We don't need heavy parallel preprocessing, but rather a good response time with some concurrency from time to time.

In general, I would naturally expect `tokenizer(text)" call to be stateless so independent between concurrent calls. Although, I understand it's not possible regarding the current architecture of fast tokenizer having the Rust backend.

Narsil commented 1 year ago

It's about the choice that was done about padding_strategy.

Making it stateless means that every single call from python to rust needs to pass it to the caller. Meaning there's is a string passing the Python->Rust boundary for every single call.

It turns out that Python -> Rust is not a free boundary, some calculations have to happen. We didn't make actual measurements, but it could be quite hurtful to make rust purely stateless.

Since in most cases users use either padding or no padding strategy (usually training vs inference) then being it stateful is correct in most cases. The last version showcases how to actually have only 2 stateless tokenizers.

Hope that helps.

asyncio doesn't change anything to how your example should fail. It's the threading that's causing issues, not async (since tokenizers will block the thread anyway)

oborchers commented 1 year ago

This may come incredibly late, but if you are working with micro-services and are willing to exchange the base call by a post request I would much rather suggest to:

Deploy the tokenizer + model via kserve/ray
Deploy the tokenizer + onnx/trt model via triton Inference server

All my scaling and threading headaches when working with this in pure fastapi/flask fashion are resolved since then.

lumpidu commented 11 months ago

The easiest and least intrusive way IMHO is using a Python queue, which is multi-threaded per-se.

Let's assume you have N threads, instead of creating one tokenizer instance per thread, one creates M tokenizer instances, where M could only be 1 as a default value - which is equivalent of using a simple lock. Inside the initialization you put M instances of your tokenizer into the queue and afterwards only use queue.get() and queue.put() when you need to access any of these instances.

The latter should be done inside a try: .. finalize: block, so that a tokenizer always is guaranteed to be returned into the queue again e.g. in case of exceptions.

queue.get() will block the calling thread as long as there are no available free tokenizer instances and will immediately unblock the thread if another thread puts a tokenizer back to the queue. As Python queues are FIFO's, it's also guaranteed that all elements inside the queue are used round-robin.

The necessary code is minimal and always thread-safe and you can decouple the number of your threads from the number of your tokenizer instances. This makes the resource usage very controllable as well.

Pros & Cons:

The problem with the above approach, is that as long as M is less than N, there will be thread-contention in heavy load situations. Most normal operating systems don't make any guarantees for waiting threads to be scheduled in FIFO-order. This means, there is no latency guarantee for e.g. your gRPC or webserver thread to get hold of a tokenizer instance before another thread that took the queue later. In most cases this is not an issue, but if your server is under heavy load, that's the reason why you often see high latency spikes. There is a reason there are realtime operating systems out there that make those guarantees.

I.e. if you are requiring strict latency timelines, you need to have M == N.

side-note

The issue here is not a tokenizer bug, it's a misunderstanding of the user about the guarantees that the tokenizer package makes in terms of multi-threading. If a package is not multi-threading safe, the user needs to take care of the consequences and one shouldn't assume it's a bug of the package, because thread-safe operation has overheads, especially in those interpreter languages as Python with one GIL and if you only want to use one thread, this overhead shouldn't be the default.

strategy155 commented 9 months ago

Did I understand correctly that it is not a bug, but a slight misunderstanding of the non-thread safe nature of the python->rust boundary? Should it be closed then? Or maybe, as a part of a fix, one should compute what would it cost to make the calls stateless?

KelleyYin commented 6 months ago

I solved this problem using a thread lock in Python.

from threading import Lock 
lock = Lock()

lock.acquire()
model.encode()
lock.release()

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / tokenizers

RuntimeError: Already borrowed #537

Pros & Cons:

side-note