neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.94k stars 169 forks source link

[text-gen] lock tokenizer call in process inputs #1547

Closed bfineran closed 6 months ago

bfineran commented 6 months ago

the tokenizer call in process inputs of the text gen pipeline has been running into a race condition in the tokenizer source code when receiving multiple concurrent requests. the issue is due to a tokenizer pass changing the tokenizers state which leads to a conflict when multiple threads try to update the tokenizer.

per @SageMoore this happens consistently on certain machines but at different points in the run

given the tokenization step is relatively fast, we will first try to lock the tokenizer call, if this becomes a bottleneck, we can look into keeping multiple tokenizers and look into potentially avoiding updating the state

error snippet

File "/home/sage/git/wand/deepsparse/src/deepsparse/pipeline.py", line 242, in run_async
    await outputs
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/sage/git/wand/deepsparse/src/deepsparse/operators/operator.py", line 98, in __call__
    run_output = self.run(
  File "/home/sage/git/wand/deepsparse/src/deepsparse/transformers/pipelines/text_generation/process_inputs.py", line 84, in run
    input_tokens = self.tokenizer(
  File "/home/sage/fifth-local-deepsparse/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2802, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/home/sage/fifth-local-deepsparse/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2908, in _call_one
    return self.encode_plus(
  File "/home/sage/fifth-local-deepsparse/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2981, in encode_plus
    return self._encode_plus(
  File "/home/sage/fifth-local-deepsparse/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/home/sage/fifth-local-deepsparse/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 496, in _batch_encode_plus
    self.set_truncation_and_padding(
  File "/home/sage/fifth-local-deepsparse/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 467, in set_truncation_and_padding
    self._tokenizer.enable_padding(**target)
RuntimeError: Already borrowed