triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
687 stars 45 forks source link

Enabling Redis cache throws: Unable to find shared library libtritonserver.so #46

Closed zbloss closed 6 months ago

zbloss commented 7 months ago

Description

Running the docker image (below) with redis cache settings enabled via cache_config causes the triton server to fail on launch.

When I comment out both the cache_config and cache_directory options, the server starts and runs successfully but does not utilize the redis cache.

When I uncomment the cache_config and cache_directory options, the server fails to find the libtritonserver.so shared library file.

To reproduce


FROM nvcr.io/nvidia/tritonserver:23.10-pyt-python-py3

WORKDIR /app

COPY . .

RUN pip install poetry

# Build https://github.com/triton-inference-server/redis_cache

RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -y python3 python3-distutils python-is-python3 git \
    build-essential libssl-dev zlib1g-dev \
    libbz2-dev libreadline-dev libsqlite3-dev curl \
    libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
    openssh-client cmake rapidjson-dev

RUN git clone https://github.com/triton-inference-server/redis_cache.git && \
    cd redis_cache && \
    ./build.sh && \
    mkdir -p /opt/tritonserver/caches/redis && \
    cp /app/redis_cache/build/libtritoncache_redis.so /opt/tritonserver/caches/redis/libtritoncache_redis.so

RUN cd triton-server && \
    poetry config virtualenvs.create false && \
    poetry install 

# HTTP, gRPC, & metrics traffic respecitvely
EXPOSE 8000 \
       8001 \
       8002

Fails with unable to find shared library libtritonserver.so

# server
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig

import os
import logging
import torch
import numpy as np
from dotenv import load_dotenv
load_dotenv()

logger = logging.getLogger("triton-server.triton_server.main")

REDIS_HOST: str = os.getenv('REDIS_HOST', 'redis')
REDIS_PORT: str = os.getenv('REDIS_PORT', '6379')
MAX_BATCH_SIZE: int = int(os.getenv('MAX_BATCH_SIZE', '8'))
VERBOSE: bool = bool(os.getenv('VERBOSE', 'True'))
TRITON_SERVER_HOST: str = os.getenv('TRITON_SERVER_HOST', '0.0.0.0')
TRITON_SERVER_PORT: str = os.getenv('TRITON_SERVER_PORT', '8000')

triton_config = TritonConfig(
    cache_config=[f"redis,host={REDIS_HOST}", f"redis,port={REDIS_PORT}"],
    http_address=TRITON_SERVER_HOST,
    http_port=TRITON_SERVER_PORT,
    cache_directory="/opt/tritonserver/caches",
)

def load_model() -> tuple:
    ....
    return model, tokenizer

@batch
def inference(sequence_of_text: np.ndarray):
    ...
    return [last_hidden_states]

def main(model_hidden_dim: int = 768):

    log_level = logging.DEBUG if VERBOSE else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    with Triton(config=triton_config) as triton:
        logger.info("Loading model...")
        triton.bind(
            model_name='model',
            infer_func=inference,
            inputs=[
                Tensor(name="sequence_of_text", dtype=bytes, shape=(1, )),
            ],
            outputs=[
                Tensor(
                    name="last_hidden_state", dtype=np.float32, shape=(-1, model_hidden_dim)
                ),
            ],
            config=ModelConfig(max_batch_size=MAX_BATCH_SIZE, response_cache=True),
            strict=True,
        )
        logger.info("Serving inference")
        triton.serve()

if __name__ == "__main__":
    model, tokenizer = load_model()
    main(model.config.dim)

Does not fail, but does not utilize Redis

# server
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig

import os
import logging
import torch
import numpy as np
from dotenv import load_dotenv
load_dotenv()

logger = logging.getLogger("triton-server.triton_server.main")

REDIS_HOST: str = os.getenv('REDIS_HOST', 'redis')
REDIS_PORT: str = os.getenv('REDIS_PORT', '6379')
MAX_BATCH_SIZE: int = int(os.getenv('MAX_BATCH_SIZE', '8'))
VERBOSE: bool = bool(os.getenv('VERBOSE', 'True'))
TRITON_SERVER_HOST: str = os.getenv('TRITON_SERVER_HOST', '0.0.0.0')
TRITON_SERVER_PORT: str = os.getenv('TRITON_SERVER_PORT', '8000')

triton_config = TritonConfig(
    # cache_config=[f"redis,host={REDIS_HOST}", f"redis,port={REDIS_PORT}"],
    http_address=TRITON_SERVER_HOST,
    http_port=TRITON_SERVER_PORT,
    # cache_directory="/opt/tritonserver/caches",
)

def load_model() -> tuple:
    ....
    return model, tokenizer

@batch
def inference(sequence_of_text: np.ndarray):
    ...
    return [last_hidden_states]

def main(model_hidden_dim: int = 768):

    log_level = logging.DEBUG if VERBOSE else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    with Triton(config=triton_config) as triton:
        logger.info("Loading model...")
        triton.bind(
            model_name='model',
            infer_func=inference,
            inputs=[
                Tensor(name="sequence_of_text", dtype=bytes, shape=(1, )),
            ],
            outputs=[
                Tensor(
                    name="last_hidden_state", dtype=np.float32, shape=(-1, model_hidden_dim)
                ),
            ],
            config=ModelConfig(max_batch_size=MAX_BATCH_SIZE, response_cache=True),
            strict=True,
        )
        logger.info("Serving inference")
        triton.serve()

if __name__ == "__main__":
    model, tokenizer = load_model()
    main(model.config.dim)

Observed results and expected behavior

Please describe the observed results as well as the expected results. If possible, attach relevant log output to help analyze your problem. If an error is raised, please paste the full traceback of the exception.

Environment

Additional context Add any other context about the problem here.

piotrm-nvidia commented 7 months ago

I appreciate your feedback and I’m sorry to hear that you are having trouble with PyTriton and Redis cache. I have some suggestions that might help you resolve the issue.

First, please make sure that you are using the latest version of PyTriton, which is 0.4.1 as of now.

I tried to reproduce your issue with PyTriton 0.4.1 at Linux Ubuntu 22.04 at AMD64 CPU. I see that redis cache is working exactly as expected. The first inference is processed by model and all further requests are served from cache.

I used interactive docker mode and ipython to run all steps so it is easier to inspect what is going on. Please use my reproduction path as reference for your own testing.

Reproduction path

Docker was started in interactive mode:

docker run -ti --network=host --platform linux/amd64 --ulimit core=-1 --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --cap-add=SYS_PTRACE --shm-size 2G nvcr.io/nvidia/tritonserver:23.10-pyt-python-py3  bash

Install poetry

root:/opt/tritonserver# pip install poetry

Install dependencies

root:/opt/tritonserver# DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -y python3 python3-distutils python-is-python3 git \
    build-essential libssl-dev zlib1g-dev \
    libbz2-dev libreadline-dev libsqlite3-dev curl \
    libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
    openssh-client cmake rapidjson-dev

Make folder app:

mkdir /app
cd /app/

Clone redis cache:

git clone https://github.com/triton-inference-server/redis_cache.git

Compile redis cache:

root:/app/redis_cache# ./build.sh 

Copy cache library:

``

Library from build folder copied to caches:

root:/app/redis_cache# cp  /app/redis_cache/build/libtritoncache_redis.so /opt/tritonserver/caches/redis/libtritoncache_redis.so

Add repositories tools for redis installation:

apt install lsb-release curl gpg

Add redis repository:

curl -fsSL https://packages.redis.io/gpg | gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/redis.list

Install redis

apt-get update
apt-get install redis

Start redis:

redis-server

Install current version of pytriton and ipython:

pip install nvidia-pytriton
pip install ipython

Start ipython

Enable logging:

import logging

logging.basicConfig(
    level=logging.DEBUG, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
)

Create inference callable:

import time
import numpy as np
def _infer_fn(requests):
    #print(requests)
    if len(requests) > 1:
        raise Exception("Only one request is supported")
    request = requests[0]
    text = np.char.decode(request["text"].astype("bytes"), "utf-8").item()
    for response in text.split():
        return_value = {
            "text": np.char.encode(response, "utf-8"),
        }
        return [return_value]
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig

Prepare config with redis:

triton_config = TritonConfig(
    cache_config=[f"redis,host=localhost", f"redis,port=6379"],
    cache_directory="/opt/tritonserver/caches",
)

Prepare server configuration:

triton = Triton(config=triton_config)
triton.bind(
    model_name="Test",
    infer_func=_infer_fn,
    inputs=[
        Tensor(name="text", dtype=bytes, shape=(-1,)),
    ],
    outputs=[
        Tensor(name="text", dtype=bytes, shape=(-1,)),
    ],
    config=ModelConfig(max_batch_size=1, response_cache=True),
)

Run Triton:

triton.run()

Log from starting Triton:

2023-11-30 10:29:38,466 - DEBUG - pytriton.triton: Preparing Triton Inference Server binaries and libs for execution.
2023-11-30 10:29:38,492 - DEBUG - pytriton.triton: Triton Inference Server binaries copied to /root/.cache/pytriton/workspace__lhrs25_/tritonserver without stubs.
2023-11-30 10:29:38,492 - DEBUG - pytriton.utils.distribution: Obtained pytriton module path: /usr/local/lib/python3.10/dist-packages/pytriton
2023-11-30 10:29:38,492 - DEBUG - pytriton.utils.distribution: Obtained pytriton stubs path for 3.10: /usr/local/lib/python3.10/dist-packages/pytriton/tritonserver/python_backend_stubs/3.10/triton_python_backend_stub
2023-11-30 10:29:38,492 - DEBUG - pytriton.triton: Copying stub for version 3.10 from /usr/local/lib/python3.10/dist-packages/pytriton/tritonserver/python_backend_stubs/3.10/triton_python_backend_stub to /root/.cache/pytriton/workspace__lhrs25_/tritonserver/backends/python/triton_python_backend_stub
2023-11-30 10:29:38,494 - DEBUG - pytriton.triton: Triton Inference Server binaries ready in /root/.cache/pytriton/workspace__lhrs25_/tritonserver
2023-11-30 10:29:38,494 - DEBUG - pytriton.utils.distribution: Obtained pytriton module path: /usr/local/lib/python3.10/dist-packages/pytriton
2023-11-30 10:29:38,494 - DEBUG - pytriton.utils.distribution: Obtained pytriton module path: /usr/local/lib/python3.10/dist-packages/pytriton
2023-11-30 10:29:38,494 - DEBUG - pytriton.utils.distribution: pytriton is installed in editable mode: False
2023-11-30 10:29:38,494 - DEBUG - pytriton.utils.distribution: Obtained nvidia_pytriton.libs path: /usr/local/lib/python3.10/dist-packages/nvidia_pytriton.libs
2023-11-30 10:29:38,495 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:38,495 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:38,495 - DEBUG - pytriton.triton: Starting Triton Inference
2023-11-30 10:29:38,495 - DEBUG - pytriton.server.triton_server: Triton Server binary /root/.cache/pytriton/workspace__lhrs25_/tritonserver/bin/tritonserver. Environment:
{
    "NPP_VERSION": "12.2.1.4",
    "SHELL": "/bin/bash",
    "NVIDIA_VISIBLE_DEVICES": "all",
    "DALI_BUILD": "9783408",
    "CUSOLVER_VERSION": "11.5.2.141",
    "CUBLAS_VERSION": "12.2.5.6",
    "HOSTNAME": "piotrmubuntu2204",
    "DCGM_VERSION": "2.4.7",
    "NVIDIA_REQUIRE_CUDA": "cuda>=9.0",
    "CUFFT_VERSION": "11.0.8.103",
    "CUDA_CACHE_DISABLE": "1",
    "NCCL_VERSION": "2.19.3",
    "CUSPARSE_VERSION": "12.1.2.141",
    "ENV": "/etc/shinit_v2",
    "PWD": "/opt/tritonserver",
    "OPENUCX_VERSION": "1.15.0",
    "NSIGHT_SYSTEMS_VERSION": "2023.3.1.92",
    "NVIDIA_DRIVER_CAPABILITIES": "compute,utility,video",
    "POLYGRAPHY_VERSION": "0.49.0",
    "TF_ENABLE_WINOGRAD_NONFUSED": "1",
    "TRT_VERSION": "8.6.1.6+cuda12.0.1.011",
    "NVIDIA_PRODUCT_NAME": "Triton Server",
    "RDMACORE_VERSION": "39.0",
    "HOME": "/root",
    "LS_COLORS": "rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:",
    "CUDA_VERSION": "12.2.2.009",
    "CURAND_VERSION": "10.3.3.141",
    "TCMALLOC_RELEASE_RATE": "200",
    "CUTENSOR_VERSION": "1.7.0.1",
    "TRITON_SERVER_GPU_ENABLED": "1",
    "HPCX_VERSION": "2.16rc4",
    "LESSCLOSE": "/usr/bin/lesspipe %s %s",
    "TERM": "xterm",
    "TRITON_SERVER_VERSION": "2.39.0",
    "GDRCOPY_VERSION": "2.3",
    "LESSOPEN": "| /usr/bin/lesspipe %s",
    "OPENMPI_VERSION": "4.1.5rc2",
    "NVJPEG_VERSION": "12.2.2.4",
    "LIBRARY_PATH": "/usr/local/cuda/lib64/stubs:",
    "SHLVL": "1",
    "BASH_ENV": "/etc/bash.bashrc",
    "TF_AUTOTUNE_THRESHOLD": "2",
    "CUDNN_VERSION": "8.9.5.29",
    "NVIDIA_TRITON_SERVER_BASE_VERSION": "23.10",
    "NSIGHT_COMPUTE_VERSION": "2023.2.2.3",
    "DALI_VERSION": "1.30.0",
    "NVIDIA_TRITON_SERVER_VERSION": "23.10",
    "LD_LIBRARY_PATH": "/opt/hpcx/ucc/lib/:/opt/hpcx/ucx/lib/:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib/python3.10/dist-packages/nvidia_pytriton.libs",
    "NVIDIA_BUILD_ID": "72127154",
    "OMPI_MCA_coll_hcoll_enable": "0",
    "OPAL_PREFIX": "/opt/hpcx/ompi",
    "CUDA_DRIVER_VERSION": "535.104.05",
    "TRANSFORMER_ENGINE_VERSION": "0.12",
    "_CUDA_COMPAT_PATH": "/usr/local/cuda/compat",
    "NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS": "",
    "PATH": "/usr/bin:/opt/tritonserver/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin",
    "TRITON_SERVER_USER": "triton-server",
    "MOFED_VERSION": "5.4-rdmacore39.0",
    "TRTOSS_VERSION": "23.10",
    "DEBIAN_FRONTEND": "noninteractive",
    "TF_ADJUST_HUE_FUSED": "1",
    "TF_ADJUST_SATURATION_FUSED": "1",
    "UCX_MEM_EVENTS": "no",
    "_": "/usr/local/bin/ipython",
    "LC_CTYPE": "C.UTF-8"
}
2023-11-30 10:29:38,526 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=119.9999487400055)
I1130 10:29:38.546780 696 cache_manager.cc:174] Creating TritonCache with name: 'redis', libpath: '/opt/tritonserver/caches/redis/libtritoncache_redis.so', cache_config: '{"host":"localhost","port":"6379"}'
I1130 10:29:38.718499 696 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f5438000000' with size 268435456
I1130 10:29:38.718715 696 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1130 10:29:38.719366 696 server.cc:592] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1130 10:29:38.719381 696 server.cc:619] 
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I1130 10:29:38.719391 696 server.cc:662] 
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I1130 10:29:38.777869 696 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA RTX A6000
I1130 10:29:38.778072 696 metrics.cc:710] Collecting CPU metrics
I1130 10:29:38.778203 696 tritonserver.cc:2458] 
+----------------------------------+------------------------------------------+
| Option                           | Value                                    |
+----------------------------------+------------------------------------------+
| server_id                        | triton                                   |
| server_version                   | 2.39.0                                   |
| server_extensions                | classification sequence model_repository |
|                                  |  model_repository(unload_dependents) sch |
|                                  | edule_policy model_configuration system_ |
|                                  | shared_memory cuda_shared_memory binary_ |
|                                  | tensor_data parameters statistics trace  |
|                                  | logging                                  |
| model_repository_path[0]         | /root/.cache/pytriton/workspace__lhrs25_ |
| model_control_mode               | MODE_EXPLICIT                            |
| strict_model_config              | 0                                        |
| rate_limit                       | OFF                                      |
| pinned_memory_pool_byte_size     | 268435456                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                 |
| min_supported_compute_capability | 6.0                                      |
| strict_readiness                 | 1                                        |
| exit_timeout                     | 30                                       |
| cache_enabled                    | 1                                        |
+----------------------------------+------------------------------------------+

I1130 10:29:38.779899 696 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I1130 10:29:38.780082 696 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
I1130 10:29:38.825539 696 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
2023-11-30 10:29:39,532 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:39,533 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:39,533 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=119.99999141693115)
2023-11-30 10:29:39,535 - DEBUG - pytriton.client.client: Closing ModelClient
[INFO/BlocksStoreManager-2] child process calling self.run()
[INFO/BlocksStoreManager-2] manager serving at '/root/.cache/pytriton/workspace__lhrs25_/data_store.sock'
2023-11-30 10:29:39,880 - DEBUG - pytriton.proxy.communication: Started remote block store at /root/.cache/pytriton/workspace__lhrs25_/data_store.sock (pid=733)
2023-11-30 10:29:39,880 - DEBUG - pytriton.models.manager: Crating model Test with version 1.
2023-11-30 10:29:39,882 - DEBUG - pytriton.proxy.inference_handler: Binding IPC socket at ipc:///root/.cache/pytriton/workspace__lhrs25_/ipc_proxy_backend_Test_0.
2023-11-30 10:29:39,884 - DEBUG - pytriton.proxy.communication: Already connectd to remote block store at /root/.cache/pytriton/workspace__lhrs25_/data_store.sock
2023-11-30 10:29:39,885 - DEBUG - pytriton.proxy.inference_handler: Waiting for requests from proxy model for Test.
2023-11-30 10:29:39,885 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:39,886 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:39,886 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=119.99999284744263)
I1130 10:29:39.894993 696 model_lifecycle.cc:461] loading: Test:1
I1130 10:29:41.234660 696 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: Test_0_0 (CPU device 0)
2023-11-30 10:29:41,529 - DEBUG - pytriton.models.model: Closing handshake socket
2023-11-30 10:29:41,536 - DEBUG - pytriton.client.client: Closing ModelClient
2023-11-30 10:29:41,537 - DEBUG - pytriton.models.manager: Done.
2023-11-30 10:29:41,537 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:41,538 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:41,538 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=59.99999165534973)
2023-11-30 10:29:41,541 - DEBUG - pytriton.client.utils: Waiting for model Test/1 to be ready (timeout=59.99741291999817)
2023-11-30 10:29:41,542 - DEBUG - pytriton.client.client: Closing ModelClient
2023-11-30 10:29:41,542 - INFO - pytriton.triton: Infer function available as model: `/v2/models/Test`
2023-11-30 10:29:41,542 - INFO - pytriton.triton:   Status:         `GET  /v2/models/Test/ready/`
2023-11-30 10:29:41,542 - INFO - pytriton.triton:   Model config:   `GET  /v2/models/Test/config/`
2023-11-30 10:29:41,543 - INFO - pytriton.triton:   Inference:      `POST /v2/models/Test/infer/`
2023-11-30 10:29:41,543 - INFO - pytriton.triton: Read more about configuring and serving models in documentation: https://triton-inference-server.github.io/pytriton.
2023-11-30 10:29:41,543 - INFO - pytriton.triton: (Press CTRL+C or use the command `kill -SIGINT 503` to send a SIGINT signal and quit)
I1130 10:29:41.536328 696 model_lifecycle.cc:818] successfully loaded
 'Test'

The log indicated that cache is active and cache library is loaded.

First client run with empty cache:

from pytriton.client import ModelClient
import numpy as np
cl = ModelClient("localhost", "Test")
cl.infer_batch(np.array([["Test text ".encode('utf-8')]]))

Log output:

2023-11-30 10:29:54,252 - DEBUG - pytriton.client.utils: Adding http scheme to localhost
2023-11-30 10:29:54,253 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://localhost:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:54,253 - DEBUG - pytriton.client.utils: Adding http scheme to localhost
2023-11-30 10:29:54,253 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://localhost:8000 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-30 10:29:54,254 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=299.9999930858612)
2023-11-30 10:29:54,256 - DEBUG - pytriton.client.utils: Waiting for model Test/<latest> to be ready (timeout=299.99828147888184)
2023-11-30 10:29:54,256 - DEBUG - pytriton.client.utils: Obtaining model Test config
2023-11-30 10:29:54,258 - DEBUG - pytriton.model_config.parser: Parsing Triton config model from dict: 
{
    "name": "Test",
    "platform": "",
    "backend": "python",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1,
    "input": [
        {
            "name": "text",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "text",
            "data_type": "TYPE_STRING",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "dynamic_batching": {
        "preferred_batch_size": [
            1
        ],
        "max_queue_delay_microseconds": 0,
        "preserve_ordering": false,
        "priority_levels": 0,
        "default_priority_level": 0,
        "priority_queue_policy": {}
    },
    "instance_group": [
        {
            "name": "Test_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "shared-memory-socket": {
            "string_value": "ipc:///root/.cache/pytriton/workspace__lhrs25_/ipc_proxy_backend_Test"
        }
    },
    "model_warmup": [],
    "response_cache": {
        "enable": true
    }
}
2023-11-30 10:29:54,258 - DEBUG - pytriton.model_config.parser: backend_parameters_config is a dictionary: {'shared-memory-socket': {'string_value': 'ipc:///root/.cache/pytriton/workspace__lhrs25_/ipc_proxy_backend_Test'}}
2023-11-30 10:29:54,259 - DEBUG - pytriton.client.utils: Model config: TritonModelConfig(model_name='Test', model_version=1, max_batch_size=1, batching=True, batcher=DynamicBatcher(max_queue_delay_microseconds=0, preferred_batch_size=[1], preserve_ordering=False, priority_levels=0, default_priority_level=0, default_queue_policy=None, priority_queue_policy=None), instance_group={<DeviceKind.KIND_CPU: 'KIND_CPU'>: 1}, decoupled=False, backend_parameters={'shared-memory-socket': 'ipc:///root/.cache/pytriton/workspace__lhrs25_/ipc_proxy_backend_Test'}, inputs=[TensorSpec(name='text', shape=(-1,), dtype=<class 'numpy.bytes_'>, optional=False)], outputs=[TensorSpec(name='text', shape=(-1,), dtype=<class 'numpy.bytes_'>, optional=False)], response_cache=ResponseCache(enable=True))
2023-11-30 10:29:54,259 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=299.9999933242798)
2023-11-30 10:29:54,259 - DEBUG - pytriton.client.utils: Waiting for model Test/<latest> to be ready (timeout=299.9993267059326)
2023-11-30 10:29:54,260 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=299.9989459514618)
2023-11-30 10:29:54,260 - DEBUG - pytriton.client.utils: Waiting for model Test/<latest> to be ready (timeout=299.9983859062195)
2023-11-30 10:29:54,261 - DEBUG - pytriton.client.utils: Obtaining model Test config
2023-11-30 10:29:54,263 - DEBUG - pytriton.model_config.parser: Parsing Triton config model from dict: 
{
    "name": "Test",
    "platform": "",
    "backend": "python",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1,
    "input": [
        {
            "name": "text",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "text",
            "data_type": "TYPE_STRING",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "dynamic_batching": {
        "preferred_batch_size": [
            1
        ],
        "max_queue_delay_microseconds": 0,
        "preserve_ordering": false,
        "priority_levels": 0,
        "default_priority_level": 0,
        "priority_queue_policy": {}
    },
    "instance_group": [
        {
            "name": "Test_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "shared-memory-socket": {
            "string_value": "ipc:///root/.cache/pytriton/workspace__lhrs25_/ipc_proxy_backend_Test"
        }
    },
    "model_warmup": [],
    "response_cache": {
        "enable": true
    }
}
2023-11-30 10:29:54,263 - DEBUG - pytriton.model_config.parser: backend_parameters_config is a dictionary: {'shared-memory-socket': {'string_value': 'ipc:///root/.cache/pytriton/workspace__lhrs25_/ipc_proxy_backend_Test'}}
2023-11-30 10:29:54,263 - DEBUG - pytriton.client.utils: Model config: TritonModelConfig(model_name='Test', model_version=1, max_batch_size=1, batching=True, batcher=DynamicBatcher(max_queue_delay_microseconds=0, preferred_batch_size=[1], preserve_ordering=False, priority_levels=0, default_priority_level=0, default_queue_policy=None, priority_queue_policy=None), instance_group={<DeviceKind.KIND_CPU: 'KIND_CPU'>: 1}, decoupled=False, backend_parameters={'shared-memory-socket': 'ipc:///root/.cache/pytriton/workspace__lhrs25_/ipc_proxy_backend_Test'}, inputs=[TensorSpec(name='text', shape=(-1,), dtype=<class 'numpy.bytes_'>, optional=False)], outputs=[TensorSpec(name='text', shape=(-1,), dtype=<class 'numpy.bytes_'>, optional=False)], response_cache=ResponseCache(enable=True))
2023-11-30 10:29:54,264 - DEBUG - pytriton.client.client: Sending inference request to Triton Inference Server
2023-11-30 10:29:54,300 - DEBUG - pytriton.proxy.inference_handler: Preparing inputs for Test.
2023-11-30 10:29:54,300 - DEBUG - pytriton.proxy.inference_handler: Processing inference callback for Test.
2023-11-30 10:29:54,301 - DEBUG - pytriton.proxy.inference_handler: Validating outputs for Test.
2023-11-30 10:29:54,301 - DEBUG - pytriton.proxy.validators: Outputs: [{'text': array(b'Test', dtype='|S4')}]
2023-11-30 10:29:54,301 - DEBUG - pytriton.proxy.validators: Response: {'text': array(b'Test', dtype='|S4')}
2023-11-30 10:29:54,301 - DEBUG - pytriton.proxy.validators: text: b'Test'
2023-11-30 10:29:54,301 - DEBUG - pytriton.proxy.inference_handler: Copying outputs to shared memory for Test.
2023-11-30 10:29:54,303 - DEBUG - pytriton.proxy.inference_handler: Sending response: InferenceHandlerResponses(responses=[MetaRequestResponse(idx=0, data={'text': 'psm_75d63589:65'}, parameters=None, eos=False)], error=None)
2023-11-30 10:29:54,304 - DEBUG - pytriton.proxy.inference_handler: Send eos response to proxy model for Test.
2023-11-30 10:29:54,304 - DEBUG - pytriton.proxy.communication: Releasing shared memory block for tensor psm_75d63589:0
2023-11-30 10:29:54,304 - DEBUG - pytriton.proxy.inference_handler: Waiting for requests from proxy model for Test.
Out[12]: {'text': array(b'Test', dtype=object)}

The inference callable was called here so cache was not used.

Second run with cache containging response:

cl.infer_batch(np.array([["Test text ".encode('utf-8')]]))

Log output:

In [7]: cl.infer_batch(np.array([["Test text ".encode('utf-8')]]))
2023-11-30 10:43:56,994 - DEBUG - pytriton.client.client: Sending inference request to Triton Inference Server
Out[7]: {'text': array(b'Test', dtype=object)}

Only client is present in log because cashe served answer.

zbloss commented 7 months ago

Thanks for helping out @piotrm-nvidia , I am using nvidia-pytriton==0.4.1.

  1. Are you suggesting the issue could be that I'm on a Mac and not building this from a linux machine?
  2. Can you specify which image you are using? your code snippet cut-off at nvcr.io/nvidia/tritonserver:23.
  3. Are you running redis from the same machine you're running pytriton?
    • I am trying to connect to redis running on a different docker container in a docker-compose.yml

Also, would you mind sharing how large your compiled docker image is? docker image ls With redis disabled I'm building images that are roughly 21GB and I'm not sure why

piotrm-nvidia commented 7 months ago
  1. Docker running with Linux Ubuntu derived container in OSx should work. See: https://github.com/triton-inference-server/pytriton/issues/44
  2. I used this docker image for AMD64 Linux:

    nvcr.io/nvidia/tritonserver:23.10-pyt-python-py3
  3. I executed redist server in the same docker container like Triton server. I run docker in interactive mode so I can run multiple processes in single container. You can check connection to Redis between containers using redis-cli.

I can ps my instance of container:

$ docker ps --all --size
CONTAINER ID   IMAGE                                              COMMAND                  CREATED        STATUS                      PORTS     NAMES                 SIZE
6d7a1947efcb   nvcr.io/nvidia/tritonserver:23.10-pyt-python-py3   "/opt/nvidia/nvidia_…"   27 hours ago   Exited (0) 27 hours ago               reverent_joliot       51.3kB (virtual 9.72GB)
github-actions[bot] commented 6 months ago

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 6 months ago

This issue was closed because it has been stalled for 7 days with no activity.