triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.39k stars 1.49k forks source link

Encounter `Stub process is not healthy` only with kserve pod #7547

Closed thechaos16 closed 3 months ago

thechaos16 commented 3 months ago

Description I tried to launch triton server with vllm (with llama3 8B model on H100). When I tried to deploy a pod by myself (with argoCD), it works well, but somehow it shows Stub process is not healthy when I try to deploy a pod within Kserve (with exactly same setup).

Triton Information

TRITON_SERVER_VERSION=2.48.0
NVIDIA_TRITON_SERVER_VERSION=24.07

Are you using the Triton container or did you build it yourself?

RUN pip3 install --upgrade vllm transformers && \ wget https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.2/flashinfer-0.1.2+cu121torch2.3-cp310-cp310-linux_x86_64.whl#sha256=5303ea4ca718521e167e5a4c5379f39fd3bc3cf7be16bed52e302476c1d12fa7 && \ pip3 install flashinfer-0.1.2+cu121torch2.3-cp310-cp310-linux_x86_64.whl && \ rm flashinfer-0.1.2+cu121torch2.3-cp310-cp310-linux_x86_64.whl


**To Reproduce**
Steps to reproduce the behavior.
1. build a docker image with previous Dockerfile
2. run following script inside the container, it raises an error when the container is deploy within kserve.
- script
```bash
#!/bin/bash
if [ -z "$1" ]
  then
    MODEL_NAME='meta-llama/Meta-Llama-3-8B'
  else
    MODEL_NAME=$1
fi

# build json file
mkdir ~/model_repository
mkdir ~/model_repository/vllm
mkdir ~/model_repository/vllm/1
cat > ~/model_repository/vllm/1/model.json <<EOF
{
    "model":"${MODEL_NAME}",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.5,
    "enforce_eager": "true"
}
EOF
cat > ~/model_repository/vllm/config.pbtxt <<EOF
backend: "vllm"

# The usage of device is deferred to the vLLM engine
instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]
EOF

# launch tritonserver
tritonserver --model-repository ~/model_repository --http-port 8080

INFO 08-20 06:19:34 model_runner.py:692] Loading model weights took 14.9595 GB INFO 08-20 06:19:34 gpu_executor.py:102] # GPU blocks: 11432, # CPU blocks: 2048 I0820 06:19:43.371814 2568 python_be.cc:2050] "TRITONBACKEND_ModelInstanceFinalize: delete instance state" E0820 06:19:43.372212 2568 backend_model.cc:692] "ERROR: Failed to create instance: Stub process 'vllm_0_0' is not healthy." I0820 06:19:43.372254 2568 python_be.cc:1891] "TRITONBACKEND_ModelFinalize: delete model state" E0820 06:19:43.372303 2568 model_lifecycle.cc:641] "failed to load 'vllm' version 1: Internal: Stub process 'vllm_0_0' is not healthy." I0820 06:19:43.372311 2568 model_lifecycle.cc:695] "OnLoadComplete() 'vllm' version 1" I0820 06:19:43.372321 2568 model_lifecycle.cc:733] "OnLoadFinal() 'vllm' for all version(s)" I0820 06:19:43.372327 2568 model_lifecycle.cc:776] "failed to load 'vllm'" I0820 06:19:43.372469 2568 model_lifecycle.cc:297] "VersionStates() 'vllm'" I0820 06:19:43.372504 2568 model_lifecycle.cc:297] "VersionStates() 'vllm'" I0820 06:19:43.372565 2568 server.cc:604] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0820 06:19:43.372595 2568 server.cc:631] +---------+-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +---------+-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batc | | | | h-size":"4"}} | | vllm | /opt/tritonserver/backends/vllm/model.py | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batc | | | | h-size":"4"}} | +---------+-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+

I0820 06:19:43.372623 2568 model_lifecycle.cc:276] "ModelStates()" I0820 06:19:43.372635 2568 server.cc:674] +-------+---------+----------------------------------------------------------------+ | Model | Version | Status | +-------+---------+----------------------------------------------------------------+ | vllm | 1 | UNAVAILABLE: Internal: Stub process 'vllm_0_0' is not healthy. | +-------+---------+----------------------------------------------------------------+



**Expected behavior**
- run the triton server
thechaos16 commented 3 months ago

It is because of the resources. The yaml file has empty {} resources at first, and after I put some specifications like

resources:
  limits:
    cpu: '6'
    memory: 48Gi
    nvidia.com/gpu: '1'
  requests:
    cpu: '3'
    memory: 48Gi
    nvidia.com/gpu: '1'

it works