Pods Receiving Traffic Too Early When Scaling with HPA Causes 'Socket Closed' Errors on Triton Inference Server

I have deployed the Triton Inference Server to Google Cloud Platform and I am using a Horizontal Pod Autoscaler (HPA). During normal operation with one or three pods running, the LoadBalancer distributes requests across all pods correctly. However, during LoadBalancer's scaling up from one to three pods, some requests fail with a "Socket closed" error while the new pods are in the PodInitializing state. This issue likely isn't Triton-specific but related to how traffic is being routed to pods that aren't fully ready. I am using all Probes.

Traceback (most recent call last):
  File "/home/user/projects/vertex-triton-server/models/cartographer_model/client-grpc.py", line 50, in <module>
    response = client.infer('cartographer_model', [input0, input1], outputs=outputs, model_version="1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda/envs/vertex-triton-server/lib/python3.11/site-packages/tritonclient/grpc/_client.py", line 1572, in infer
    raise_error_grpc(rpc_error)
  File "/home/user/miniconda/envs/vertex-triton-server/lib/python3.11/site-packages/tritonclient/grpc/_utils.py", line 77, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed

I am testing this scenario using a script that sends multiple requests (50) in parallel. Here is my current configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vertex-triton-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vertex-triton-server
  template:
    metadata:
      labels:
        app: vertex-triton-server
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-l4"
      containers:
      - name: vertex-triton-server
        image: image/path
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 8000
            name: http-triton
          - containerPort: 8001
            name: grpc-triton
          - containerPort: 8002
            name: metrics-triton
        resources:
          limits:
            cpu: "11"
            memory: "38Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
          requests:
            cpu: "11"
            memory: "38Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - name: model-volume
          mountPath: /models
        startupProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          periodSeconds: 10
          initialDelaySeconds: 80
          timeoutSeconds: 5
          failureThreshold: 60
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          periodSeconds: 10
          initialDelaySeconds: 20
          timeoutSeconds: 5
          failureThreshold: 60
        livenessProbe:
          failureThreshold: 60
          initialDelaySeconds: 100
          periodSeconds: 5
          httpGet:
            path: /v2/health/live
            port: 8000
        lifecycle:
          preStop:
            exec:
              command: [ "/bin/sh", "-c", "sleep 30" ]
      initContainers:
        - name: clone-models
          image: alpine/git
          command: [ 'sh', '-c' ]
          args:
            - |
              apk add --no-cache curl &&
              echo "Cloning repository..." &&
              cd /tmp &&
              git clone https://oauth2:$(GITLAB_TOKEN)@gitlab &&
              cp -r vertex-triton-server/models/* /models
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "1"
              memory: "1Gi"
          env:
            - name: GITLAB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: gitlab-token
                  key: token
          volumeMounts:
            - name: model-volume
              mountPath: /models
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      - name: model-volume
        emptyDir: {}
      imagePullSecrets: # Add this section
        - name: gcp-artifact-registery

I have observed that the issue is in a pod being prematurely added to the service endpoints while it is still in the initialization phase, specifically when the init container starts. This premature addition leads to client errors as the pod is not fully ready to handle requests.

Logs:

2024-05-29 11:29:00 vertex-triton-server-6d64b9586d-pjfl9   0/1     Pending           0          2m17s   <none>     gke-vertex-serving-cluster-gpupool-92732217-22df   <none>           <none>
2024-05-29 11:29:01 vertex-triton-server-6d64b9586d-pjfl9   0/1     Init:0/1          0          2m18s   <none>     gke-vertex-serving-cluster-gpupool-92732217-22df   <none>           <none>
2024-05-29 11:29:03 vertex-triton-server-6d64b9586d-pjfl9   0/1     Init:0/1          0          2m20s   10.4.4.4   gke-vertex-serving-cluster-gpupool-92732217-22df   <none>           <none>
2024-05-29 11:29:10 vertex-triton-server-6d64b9586d-pjfl9   0/1     PodInitializing   0          2m27s   10.4.4.4   gke-vertex-serving-cluster-gpupool-92732217-22df   <none>           <none>

During scaling up, when a new pod is in the Init:0/1 state, it is already being assigned an IP (10.4.4.4). This results in client errors as mentioned above:

tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed

However, I am able to get the READY status correctly from the pods. So, the issue is probably not related to the readiness probe.

@patriksabol very interesting problem. Pods should not be selected by a service until they're running, have passed their startup and readiness probes.

In your second post it appears that none of the pods are past the init container stage, yet you're seeing their readiness probe succeeding? Is that correct?

Given the specific error of tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed, I'd like to see the definition of your service as well. There could be something in that which is leading to the problem.

In the mean time, I'll review Triton code to see we've somehow introduced any timing issues w.r.t. readiness/liveness probes.

@whoisj In my second post, there is only one pod at four different times. I wanted to show that an IP address was assigned during the init state.

Meanwhile, I have removed the initContainer, and now the IP address is assigned to a running pod:

2024-05-29 15:13:45 vertex-triton-server-74c9fcf77f-q9gmf   0/1     ContainerCreating   0          4m45s   <none>      gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9   <none>           <none>
2024-05-29 15:15:07 vertex-triton-server-74c9fcf77f-q9gmf   0/1     Running             0          6m7s    10.96.5.4   gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9   <none>           <none>
2024-05-29 15:15:46 vertex-triton-server-74c9fcf77f-q9gmf   0/1     Running             0          6m46s   10.96.5.4   gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9   <none>           <none>
2024-05-29 15:15:46 vertex-triton-server-74c9fcf77f-q9gmf   1/1     Running             0          6m46s   10.96.5.4   gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9   <none>           <none>

But I am seeing READY status correctly for pods, meaning when READY is 1/1, using this command:

kubectl get pods -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")]}'

However, the issue with the "Socket closed" error still persists. When all pods are in the READY state, there are no "Socket closed" errors.

This is my client code:

with grpcclient.InferenceServerClient("IP_ADDRESS:8001") as client:  # Note the change to grpcclient and typically a different port
    input0 = grpcclient.InferInput("tile", [1, 3, image.shape[0], image.shape[1]], "UINT8")
    input0.set_data_from_numpy(np.expand_dims(np.moveaxis(image, -1, 0), axis=0))

    # Prepare the model_name input as an array of bytes
    model_name_bytes = np.array([model_name.encode('utf-8')])
    model_name_bytes = np.expand_dims(model_name_bytes, axis=0)
    try:
        input1 = grpcclient.InferInput("model_name", [1, 1], "BYTES")
        input1.set_data_from_numpy(model_name_bytes)

        outputs = [
            grpcclient.InferRequestedOutput("geojson_output")
        ]

        response = client.infer('cartographer_model', [input0, input1], outputs=outputs, model_version="1")

        geojson_result = response.as_numpy("geojson_output").tobytes().decode('utf-8')
        print(f'{strftime("%Y-%m-%d %H:%M:%S")} [INFO] Received GeoJSON response')
    except Exception as e:
        print(f'{strftime("%Y-%m-%d %H:%M:%S")} [ERROR] {e}')
        print(f'{strftime("%Y-%m-%d %H:%M:%S")} [ERROR] Failed to receive GeoJSON response')

This is service definition:

apiVersion: v1
kind: Service
metadata:
  name: vertex-triton-server-service
  labels:
    app: vertex-triton-server
spec:
  type: LoadBalancer
  ports:
    - port: 8000
      targetPort: 8000
      name: http
    - port: 8001
      targetPort: 8001
      name: grpc
    - port: 8002
      targetPort: 8002
      name: metrics
  selector:
    app: vertex-triton-server

in your service definition, I believe targetPort should be the name of the port in the target container.

apiVersion: v1
kind: Service
metadata:
  name: vertex-triton-server-service
  labels:
    app: vertex-triton-server
spec:
  type: LoadBalancer
  ports:
    - port: 8000
      targetPort: http-triton
      name: http
    - port: 8001
      targetPort: grpc-triton
      name: grpc
    - port: 8002
      targetPort: metrics-triton
      name: metrics
  selector:
    app: vertex-triton-server

By specifying the numeric port number, you could somehow be bypassing the service's selector. I am not 100% sure, but I think it's worth trying the port names instead to see if it resolves the issue or not. Let me know.

Unfortunately, that does not work. It seems that passing values by name is just for clarity of configuration.

As I mentioned, given the error, it appears to be a problem with the service and not with Triton Server.

Perhaps could check Triton Server logs to see if any inference requests are even being sent to the pods in question.

triton-inference-server / server

Pods Receiving Traffic Too Early When Scaling with HPA Causes 'Socket Closed' Errors on Triton Inference Server #7264