Hi
I'm trying to deploy Triton Inference Server with tensorrtllm_backend to K8S by Helm Chart by following this doc HERE and HERE
I notice the changes on recent releases as follows and want to try out the latest release (24.04-trtllm-python-py3).
Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM Backend and Python Backend. This container should have everything to run a TensorRT-LLM model.
Starting with 24.04 release, Triton Server TensrRT-LLM container comes with pre-installed TensorRT-LLM package, which allows users to build engines inside the Triton container.
My intention is to run the container (deploy the pod to K8S cluster) first without launching Triton server directly, then do these things inside container
Create Engines for llama3
Copy over the inflight batcher models repository
Modify config.pbtxt
Launch Triton Server
The default deployment.yaml HERE includes args -tritonserver, so I change the file from
to
I also change the values.yaml file from original version
to this
So, no "serverargs" will be passed to container and triton server will not be launched directly.
After the pod is scheduled, the status is CrashLoopBackOff
kubectl describe pod shows Back-off restarting failed
ubuntu@master:~/server/deploy/k8s-onprem$ kubectl describe pod example-triton-inference-server-769cd78c5c-5cmfd
Name: example-triton-inference-server-769cd78c5c-5cmfd
Namespace: default
Priority: 0
Service Account: default
Node: worker2/33.66.0.10
Start Time: Sat, 25 May 2024 15:41:02 +0000
Labels: app=triton-inference-server
pod-template-hash=769cd78c5c
release=example
Annotations: cni.projectcalico.org/containerID: 79b63e39a67b3eca50a3432df7d28e14a6e5f30c130b841033c0abbb9b148a7e
cni.projectcalico.org/podIP: 192.168.189.99/32
cni.projectcalico.org/podIPs: 192.168.189.99/32
Status: Running
IP: 192.168.189.99
IPs:
IP: 192.168.189.99
Controlled By: ReplicaSet/example-triton-inference-server-769cd78c5c
Containers:
triton-inference-server:
Container ID: containerd://f892b8aa997ebd47c2d33cadb2a44c2f0267569a07e6912c40136554e35fb6c0
Image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
Image ID: nvcr.io/nvidia/tritonserver@sha256:5450495c5274c106ceb167026008072fb675c2ff131375bc02a5d5358d6ba7ff
Ports: 8000/TCP, 8001/TCP, 8002/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 25 May 2024 15:42:32 +0000
Finished: Sat, 25 May 2024 15:42:32 +0000
Ready: False
Restart Count: 4
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Liveness: http-get http://:http/v2/health/live delay=15s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:http/v2/health/ready delay=5s timeout=1s period=5s #success=1 #failure=3
Startup: http-get http://:http/v2/health/ready delay=0s timeout=1s period=10s #success=1 #failure=30
Environment: <none>
Mounts:
/models from models (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j4z2j (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
models:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 33.66.0.18
Path: /srv
ReadOnly: false
kube-api-access-j4z2j:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 98s default-scheduler Successfully assigned default/example-triton-inference-server-769cd78c5c-5cmfd to worker2
Normal Pulled 54s (x4 over 98s) kubelet Container image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3" already present on machine
Normal Created 54s (x4 over 98s) kubelet Created container triton-inference-server
Normal Started 54s (x4 over 98s) kubelet Started container triton-inference-server
Warning BackOff 20s (x13 over 95s) kubelet Back-off restarting failed container triton-inference-server in pod example-triton-inference-server-769cd78c5c-5cmfd_default(70a18973-02d1-48a0-8a2c-96c0773f4af4)
ubuntu@master:~/server/deploy/k8s-onprem$
kubectl logs shows no error message...
ubuntu@master:~/server/deploy/k8s-onprem$ kubectl logs example-triton-inference-server-769cd78c5c-5cmfd -n default
=============================
== Triton Inference Server ==
=============================
NVIDIA Release 24.04 (build 90085495)
Triton Server Version 2.45.0
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.3 driver version 545.23.08 with kernel driver version 535.161.07.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
ubuntu@master:~/server/deploy/k8s-onprem$
Hi I'm trying to deploy Triton Inference Server with tensorrtllm_backend to K8S by Helm Chart by following this doc HERE and HERE
I notice the changes on recent releases as follows and want to try out the latest release (24.04-trtllm-python-py3).
Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM Backend and Python Backend. This container should have everything to run a TensorRT-LLM model.
Starting with 24.04 release, Triton Server TensrRT-LLM container comes with pre-installed TensorRT-LLM package, which allows users to build engines inside the Triton container.
My intention is to run the container (deploy the pod to K8S cluster) first without launching Triton server directly, then do these things inside container
The default deployment.yaml HERE includes args -tritonserver, so I change the file from to
I also change the values.yaml file from original version to this
So, no "serverargs" will be passed to container and triton server will not be launched directly.
After the pod is scheduled, the status is CrashLoopBackOff
kubectl describe pod shows Back-off restarting failed
kubectl logs shows no error message...
Has anyone encountered the similar issue?