On server/deploy/oci -> running "helm install example ." to deploy the Inference Server and pod doesn't get to running due to Liveness probe failed & Readiness probe failed

aviv12825 commented 5 months ago

On server/deploy/oci - running "helm install example ." to deploy the Inference Server and pod doesn't get to running due to Liveness probe failed & Readiness probe failed.

Below describe log details & I try to add to templates\deployment.yaml file the initialDelaySeconds: 180 which didn't help. Can someone please advise ?

Events: Type Reason Age From Message

Normal Scheduled 4m11s default-scheduler Successfully assigned default/example-triton-inference-server-9c5d9f79-74rt4 to 10.0.10.95 Warning Unhealthy 41s (x3 over 61s) kubelet Liveness probe failed: Get "http://10.0.10.177:8000/v2/health/live": dial tcp 10.0.10.177:8000: connect: connection refused Normal Killing 41s kubelet Container triton-inference-server failed liveness probe, will be restarted Normal Pulled 11s (x2 over 4m10s) kubelet Container image "nvcr.io/nvidia/tritonserver:24.03-py3" already present on machine Warning Unhealthy 11s (x13 over 66s) kubelet Readiness probe failed: Get "http://10.0.10.177:8000/v2/health/ready": dial tcp 10.0.10.177:8000: connect: connection refused Normal Created 10s (x2 over 4m10s) kubelet Created container triton-inference-server Normal Started 10s (x2 over 4m10s) kubelet Started container triton-inference-server

rmccorm4 commented 5 months ago

Hi @aviv12825,

I see the errors returned involve "connection refused". Have you confirmed from the pod logs that the server itself started up successfully to expose these endpoints?

krishung5 commented 1 month ago

Closing due to lack of activity. Please re-open the issue if you would like to follow up with this issue.

triton-inference-server / server

On server/deploy/oci -> running "helm install example ." to deploy the Inference Server and pod doesn't get to running due to Liveness probe failed & Readiness probe failed #7154