Encounter `Stub process is not healthy` only with kserve pod

Description I tried to launch triton server with vllm (with llama3 8B model on H100). When I tried to deploy a pod by myself (with argoCD), it works well, but somehow it shows Stub process is not healthy when I try to deploy a pod within Kserve (with exactly same setup).

Triton Information

TRITON_SERVER_VERSION=2.48.0
NVIDIA_TRITON_SERVER_VERSION=24.07

Are you using the Triton container or did you build it yourself?

I built container by myself based on triton container


FROM nvcr.io/nvidia/tritonserver:24.07-vllm-python-py3
ENV VLLM_ATTENTION_BACKEND=FLASHINFER

RUN pip3 install --upgrade vllm transformers && \ wget https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.2/flashinfer-0.1.2+cu121torch2.3-cp310-cp310-linux_x86_64.whl#sha256=5303ea4ca718521e167e5a4c5379f39fd3bc3cf7be16bed52e302476c1d12fa7 && \ pip3 install flashinfer-0.1.2+cu121torch2.3-cp310-cp310-linux_x86_64.whl && \ rm flashinfer-0.1.2+cu121torch2.3-cp310-cp310-linux_x86_64.whl


**To Reproduce**
Steps to reproduce the behavior.
1. build a docker image with previous Dockerfile
2. run following script inside the container, it raises an error when the container is deploy within kserve.
- script
```bash
#!/bin/bash
if [ -z "$1" ]
  then
    MODEL_NAME='meta-llama/Meta-Llama-3-8B'
  else
    MODEL_NAME=$1
fi

# build json file
mkdir ~/model_repository
mkdir ~/model_repository/vllm
mkdir ~/model_repository/vllm/1
cat > ~/model_repository/vllm/1/model.json <<EOF
{
    "model":"${MODEL_NAME}",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.5,
    "enforce_eager": "true"
}
EOF
cat > ~/model_repository/vllm/config.pbtxt <<EOF
backend: "vllm"

# The usage of device is deferred to the vLLM engine
instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]
EOF

# launch tritonserver
tritonserver --model-repository ~/model_repository --http-port 8080

yaml file

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
creationTimestamp: '2024-08-20T05:13:24Z'
finalizers:
- inferenceservice.finalizers
generation: 1
managedFields:
- apiVersion: serving.kserve.io/v1beta1
  fieldsType: FieldsV1
  fieldsV1:
    f:spec:
      .: {}
      f:predictor:
        .: {}
        f:nodeSelector: {}
        f:tolerations: {}
        f:triton:
          .: {}
          f:command: {}
          f:image: {}
          f:name: {}
          f:volumeMounts: {}
        f:volumes: {}
  manager: OpenAPI-Generator
  operation: Update
  time: '2024-08-20T05:13:24Z'
- apiVersion: serving.kserve.io/v1beta1
  fieldsType: FieldsV1
  fieldsV1:
    f:metadata:
      f:finalizers:
        .: {}
        v:"inferenceservice.finalizers": {}
  manager: manager
  operation: Update
  time: '2024-08-20T05:13:24Z'
- apiVersion: serving.kserve.io/v1beta1
  fieldsType: FieldsV1
  fieldsV1:
    f:status:
      .: {}
      f:components:
        .: {}
        f:predictor:
          .: {}
          f:latestCreatedRevision: {}
      f:conditions: {}
      f:modelStatus:
        .: {}
        f:states:
          .: {}
          f:activeModelState: {}
          f:targetModelState: {}
        f:transitionStatus: {}
      f:observedGeneration: {}
  manager: manager
  operation: Update
  subresource: status
  time: '2024-08-20T05:13:24Z'
name: huggingface-llama3
namespace: default
resourceVersion: '636321645'
uid: 720aad37-e710-498c-ac6f-b2e9e519886f
spec:
predictor:
model:
  command:
    - bash
    - '-c'
    - sleep infinity;
  image: CUSTOM_IMAGE
  modelFormat:
    name: triton
  name: triton-vllm
  protocolVersion: v2
  resources: {}
  volumeMounts:
    - mountPath: /dev/shm
      name: shmdir
volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 64G
    name: shmdir

error message


I0820 06:19:20.698021 2568 python_be.cc:1912] "TRITONBACKEND_ModelInstanceInitialize: vllm_0_0 (MODEL device 0)"
I0820 06:19:20.698052 2568 backend_model_instance.cc:77] "Creating instance vllm_0_0 on model-specified devices using artifact ''"
I0820 06:19:20.698356 2568 stub_launcher.cc:385] "Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /root/model_repository/vllm/1/model.py triton_vllm_backend_shm_region_4d97789f-d5ff-44a0-bff8-9a0f3c983ac5 1048576 1048576 2568 /opt/tritonserver/backends/python 336 vllm_0_0 /opt/tritonserver/backends/vllm"
INFO 08-20 06:19:23 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='meta-llama/Meta-Llama-3-8B', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=true, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-20 06:19:23 selector.py:80] Using Flashinfer backend.
INFO 08-20 06:19:24 model_runner.py:680] Starting to load model meta-llama/Meta-Llama-3-8B...
INFO 08-20 06:19:24 selector.py:80] Using Flashinfer backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.12it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.93s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:06<00:02,  2.47s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.72s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.39s/it]

INFO 08-20 06:19:34 model_runner.py:692] Loading model weights took 14.9595 GB INFO 08-20 06:19:34 gpu_executor.py:102] # GPU blocks: 11432, # CPU blocks: 2048 I0820 06:19:43.371814 2568 python_be.cc:2050] "TRITONBACKEND_ModelInstanceFinalize: delete instance state" E0820 06:19:43.372212 2568 backend_model.cc:692] "ERROR: Failed to create instance: Stub process 'vllm_0_0' is not healthy." I0820 06:19:43.372254 2568 python_be.cc:1891] "TRITONBACKEND_ModelFinalize: delete model state" E0820 06:19:43.372303 2568 model_lifecycle.cc:641] "failed to load 'vllm' version 1: Internal: Stub process 'vllm_0_0' is not healthy." I0820 06:19:43.372311 2568 model_lifecycle.cc:695] "OnLoadComplete() 'vllm' version 1" I0820 06:19:43.372321 2568 model_lifecycle.cc:733] "OnLoadFinal() 'vllm' for all version(s)" I0820 06:19:43.372327 2568 model_lifecycle.cc:776] "failed to load 'vllm'" I0820 06:19:43.372469 2568 model_lifecycle.cc:297] "VersionStates() 'vllm'" I0820 06:19:43.372504 2568 model_lifecycle.cc:297] "VersionStates() 'vllm'" I0820 06:19:43.372565 2568 server.cc:604] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0820 06:19:43.372595 2568 server.cc:631] +---------+-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +---------+-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batc | | | | h-size":"4"}} | | vllm | /opt/tritonserver/backends/vllm/model.py | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batc | | | | h-size":"4"}} | +---------+-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+

I0820 06:19:43.372623 2568 model_lifecycle.cc:276] "ModelStates()" I0820 06:19:43.372635 2568 server.cc:674] +-------+---------+----------------------------------------------------------------+ | Model | Version | Status | +-------+---------+----------------------------------------------------------------+ | vllm | 1 | UNAVAILABLE: Internal: Stub process 'vllm_0_0' is not healthy. | +-------+---------+----------------------------------------------------------------+



**Expected behavior**
- run the triton server

triton-inference-server / server

Encounter `Stub process is not healthy` only with kserve pod #7547