How to deploy Triton Inference Server Container (tritonserver:24.04-trtllm-python-py3) in K8S without launching Triton Server directly?

Hi I'm trying to deploy Triton Inference Server with tensorrtllm_backend to K8S by Helm Chart by following this doc HERE and HERE

I notice the changes on recent releases as follows and want to try out the latest release (24.04-trtllm-python-py3).

Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM Backend and Python Backend. This container should have everything to run a TensorRT-LLM model.
Starting with 24.04 release, Triton Server TensrRT-LLM container comes with pre-installed TensorRT-LLM package, which allows users to build engines inside the Triton container.

My intention is to run the container (deploy the pod to K8S cluster) first without launching Triton server directly, then do these things inside container

Create Engines for llama3
Copy over the inflight batcher models repository
Modify config.pbtxt
Launch Triton Server

The default deployment.yaml HERE includes args -tritonserver, so I change the file from to

I also change the values.yaml file from original version to this

So, no "serverargs" will be passed to container and triton server will not be launched directly.

After the pod is scheduled, the status is CrashLoopBackOff

kubectl describe pod shows Back-off restarting failed

ubuntu@master:~/server/deploy/k8s-onprem$ kubectl describe pod example-triton-inference-server-769cd78c5c-5cmfd
Name:             example-triton-inference-server-769cd78c5c-5cmfd
Namespace:        default
Priority:         0
Service Account:  default
Node:             worker2/33.66.0.10
Start Time:       Sat, 25 May 2024 15:41:02 +0000
Labels:           app=triton-inference-server
                  pod-template-hash=769cd78c5c
                  release=example
Annotations:      cni.projectcalico.org/containerID: 79b63e39a67b3eca50a3432df7d28e14a6e5f30c130b841033c0abbb9b148a7e
                  cni.projectcalico.org/podIP: 192.168.189.99/32
                  cni.projectcalico.org/podIPs: 192.168.189.99/32
Status:           Running
IP:               192.168.189.99
IPs:
  IP:           192.168.189.99
Controlled By:  ReplicaSet/example-triton-inference-server-769cd78c5c
Containers:
  triton-inference-server:
    Container ID:   containerd://f892b8aa997ebd47c2d33cadb2a44c2f0267569a07e6912c40136554e35fb6c0
    Image:          nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
    Image ID:       nvcr.io/nvidia/tritonserver@sha256:5450495c5274c106ceb167026008072fb675c2ff131375bc02a5d5358d6ba7ff
    Ports:          8000/TCP, 8001/TCP, 8002/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 25 May 2024 15:42:32 +0000
      Finished:     Sat, 25 May 2024 15:42:32 +0000
    Ready:          False
    Restart Count:  4
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Liveness:          http-get http://:http/v2/health/live delay=15s timeout=1s period=10s #success=1 #failure=3
    Readiness:         http-get http://:http/v2/health/ready delay=5s timeout=1s period=5s #success=1 #failure=3
    Startup:           http-get http://:http/v2/health/ready delay=0s timeout=1s period=10s #success=1 #failure=30
    Environment:       <none>
    Mounts:
      /models from models (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j4z2j (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  models:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    33.66.0.18
    Path:      /srv
    ReadOnly:  false
  kube-api-access-j4z2j:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  98s                 default-scheduler  Successfully assigned default/example-triton-inference-server-769cd78c5c-5cmfd to worker2
  Normal   Pulled     54s (x4 over 98s)   kubelet            Container image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3" already present on machine
  Normal   Created    54s (x4 over 98s)   kubelet            Created container triton-inference-server
  Normal   Started    54s (x4 over 98s)   kubelet            Started container triton-inference-server
  Warning  BackOff    20s (x13 over 95s)  kubelet            Back-off restarting failed container triton-inference-server in pod example-triton-inference-server-769cd78c5c-5cmfd_default(70a18973-02d1-48a0-8a2c-96c0773f4af4)
ubuntu@master:~/server/deploy/k8s-onprem$

kubectl logs shows no error message...

ubuntu@master:~/server/deploy/k8s-onprem$ kubectl logs example-triton-inference-server-769cd78c5c-5cmfd -n default

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 24.04 (build 90085495)
Triton Server Version 2.45.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.3 driver version 545.23.08 with kernel driver version 535.161.07.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

ubuntu@master:~/server/deploy/k8s-onprem$

Has anyone encountered the similar issue?

triton-inference-server / tensorrtllm_backend

How to deploy Triton Inference Server Container (tritonserver:24.04-trtllm-python-py3) in K8S without launching Triton Server directly? #477