triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
687 stars 45 forks source link

Python InferenceServerClient issue when call close() from __del__ #63

Closed lionsheep0724 closed 4 months ago

lionsheep0724 commented 5 months ago

Description

I got error which related to gevent when serving pytriton with faster-whisper 0.10.0. I found similar issue in triton but solutions what I found was not clear. https://github.com/triton-inference-server/pytriton/issues/56

To reproduce

If relevant, add a minimal example so that we can reproduce the error, if necessary, by running the code. For example:

# server
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton
from faster_whisper import WhisperModel

faster_whisper_model = WhisperModel("/workspace/faster-whisper")

@batch
def faster_whisper_infer_fn(**inputs: bytes):
    (audio_packet,) = inputs.values()
    audio_binary = io.BytesIO(audio_packet)
    segments, info = faster_whisper_model.transcribe(audio_binary, language="ko", beam_size=BEAM_SIZE)

    tokens = []
    for segment in segments:
        tokens += segment.tokens
    tokens_tensor: np.ndarray = torch.tensor([tokens], dtype=torch.int64).cpu().numpy()
    return {"tokens": tokens_tensor}

with Triton(
        config=TritonConfig(http_port=args.http_port, grpc_port=args.grpc_port, metrics_port=args.metrics_port)
    ) as triton:
        triton.bind(
            model_name="FasterWhisper",
            infer_func=faster_whisper_infer_fn,
            inputs=[Tensor(name="audio_input", dtype=np.uint8, shape=(-1,))], 
            outputs=[
                Tensor(name="tokens", dtype=np.int64, shape=(-1,)),
            ],
            config=ModelConfig(max_batch_size=MAX_BATCH_SIZE),
            strict=True,
        )
        triton.serve()

Observed results and expected behavior

Please describe the observed results as well as the expected results. If possible, attach relevant log output to help analyze your problem. If an error is raised, please paste the full traceback of the exception.

Observed results when server up

==========
== CUDA ==
==========

CUDA Version 11.7.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2024-02-06 06:01:54,788 - INFO - pytriton.triton: Read more about configuring and serving models in documentation: https://triton-inference-server.github.io/pytriton.
2024-02-06 06:01:54,789 - INFO - pytriton.triton: (Press CTRL+C or use the command `kill -SIGINT 1` to send a SIGINT signal and quit)
2024-02-06 06:01:54,789 - INFO - pytriton_single_pipeline: Loading STT model with batch size : 8
Exception ignored in: <function InferenceServerClient.__del__ at 0x7f751d7e85e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tritonclient/http/_client.py", line 199, in __del__
2024-02-06 06:01:56,438 - INFO - pytriton_single_pipeline: Serving inference
    self.close()
  File "/usr/local/lib/python3.10/dist-packages/tritonclient/http/_client.py", line 206, in close
    self._pool.join()
  File "/usr/local/lib/python3.10/dist-packages/gevent/pool.py", line 430, in join
    result = self._empty_event.wait(timeout=timeout)
  File "src/gevent/event.py", line 163, in gevent._gevent_cevent.Event.wait
  File "src/gevent/_abstract_linkable.py", line 509, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 206, in gevent._gevent_c_abstract_linkable.AbstractLinkable._capture_hub
gevent.exceptions.InvalidThreadUseError: (<Hub '' at 0x7f749f418720 epoll default pending=0 ref=0 fileno=10 resolver=<gevent.resolver.thread.Resolver at 0x7f748c5fdf30 pool=<ThreadPool at 0x7f748c5f8c80 tasks=0 size=1 maxsize=10 hub=<Hub at 0x7f749f418720 thread_ident=0x7f75c0637000>>> threadpool=<ThreadPool at 0x7f748c5f8c80 tasks=0 size=1 maxsize=10 hub=<Hub at 0x7f749f418720 thread_ident=0x7f75c0637000>> thread_ident=0x7f75c0637000>, None, <greenlet.greenlet object at 0x7f7493dd7b80 (otid=0x7f7493dd0450) current active started main>)
Exception ignored in: <function InferenceServerClient.__del__ at 0x7f751d7e85e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tritonclient/http/_client.py", line 199, in __del__
    self.close()
  File "/usr/local/lib/python3.10/dist-packages/tritonclient/http/_client.py", line 206, in close
    self._pool.join()
  File "/usr/local/lib/python3.10/dist-packages/gevent/pool.py", line 430, in join
    result = self._empty_event.wait(timeout=timeout)
  File "src/gevent/event.py", line 163, in gevent._gevent_cevent.Event.wait
  File "src/gevent/_abstract_linkable.py", line 509, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 206, in gevent._gevent_c_abstract_linkable.AbstractLinkable._capture_hub
gevent.exceptions.InvalidThreadUseError: (<Hub '' at 0x7f749f418720 epoll default pending=0 ref=0 fileno=10 resolver=<gevent.resolver.thread.Resolver at 0x7f748c5fdf30 pool=<ThreadPool at 0x7f748c5f8c80 tasks=0 size=1 maxsize=10 hub=<Hub at 0x7f749f418720 thread_ident=0x7f75c0637000>>> threadpool=<ThreadPool at 0x7f748c5f8c80 tasks=0 size=1 maxsize=10 hub=<Hub at 0x7f749f418720 thread_ident=0x7f75c0637000>> thread_ident=0x7f75c0637000>, None, <greenlet.greenlet object at 0x7f7493dd7b80 (otid=0x7f7493dd0450) current active started main>)

Environment

Additional context Please refer to my dockerfile

# Use Ubuntu 22.04 as base image
FROM nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04

# Install necessary packages
RUN apt update -y && apt install -y --fix-missing software-properties-common

# Add repository with various Python versions
RUN add-apt-repository ppa:deadsnakes/ppa -y

# Install Python 3.10 and required libraries
RUN apt install -y --fix-missing python3.10 libpython3.10 python3.10-distutils python3-pip \
     build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libsqlite3-dev libreadline-dev \
     libffi-dev curl libbz2-dev pkg-config make

# install dependencies
COPY ./requirements.txt /workspace/requirements.txt
RUN pip install -r  /workspace/requirements.txt

# Install openai whisper
RUN python3.10 -m pip install -U openai-whisper

# Install ffmpeg
RUN apt-get -y update && \
    apt-get -y install locales libsndfile-dev curl ffmpeg

# Install nvidia-pytriton using pip
RUN python3.10 -m pip install nvidia-pytriton==0.5.0
RUN python3.10 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

WORKDIR /workspace

# Install transformers (4.34.0)
RUN python3.10 -m pip install /workspace/audio_process/transformers
lionsheep0724 commented 5 months ago

@piotrm-nvidia Thank you for reporting this issue. However, I don't understand why this is happening in this particular case. The other one with the huggingface whisper model works fine. The only difference is whether binding faster-whisper or not.

lionsheep0724 commented 5 months ago

@piotrm-nvidia Updates here. Simply block out from faster_whisper import WhisperModel works fine. I guess some dependencies or thread-related object in ctranslate2(in faster_whisper) cause the issue.

piotrm-nvidia commented 5 months ago

Thank you for providing further details on the issue you're experiencing.

The core of the problem lies in how the Triton client interacts with gevent, particularly in scenarios involving multi-threaded operations. The Triton client incorporates a __del__ method that is responsible for cleaning up and closing connections. When the ModelClient, which utilizes the Triton client, attempts to close the connection, it does so in the appropriate thread. However, the Python garbage collector may invoke the __del__ method in a different thread at a later time. Gevent does not support operations across multiple threads by default. This leads to the InvalidThreadUseError you're encountering, as gevent detects and disallows the cross-thread operation.

Although the exception is being ignored and might seem benign, it understandably causes confusion and concern. It's important to note that this issue is specific to the interaction between gevent and the Triton client's cleanup process. The problem you've observed with the faster-whisper model, as opposed to the huggingface whisper model, suggests that certain dependencies or thread-related objects might exacerbate this issue by affecting the threading context in which the Triton client operates.

We are actively working to address this issue to prevent such confusing behavior in the future and to ensure a smoother operation with libraries that utilize gevent or similar concurrency mechanisms.

lionsheep0724 commented 5 months ago

@piotrm-nvidia Many thanks for your reply! AFAIU, the root cause is the calling of the __del__ method. This is because the threading context in faster-whisper affects Triton's. The function is called from an arbitrary thread in faster-whisper, or somewhere else, after the ModelClient communicates with the Triton client to obtain the model and other metadata(such as batch size) when closing the connection. Therefore, am I right in understanding that the current Triton version cannot handle a multi-thread-related library/framework?

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 4 months ago

This issue was closed because it has been stalled for 7 days with no activity.

piotrm-nvidia commented 4 months ago

The issue with PyTriton's multi-thread support was resolved in release 0.5.1 through a temporary solution where only INFO messages are logged to avoid cluttering the log with warnings. Additionally, the underlying issue in the tritonclient has been addressed in its repository, promising a permanent fix in future PyTriton releases. While multi-threading remains supported in older PyTriton versions, users may experience warning messages.