Open patriksabol opened 4 months ago
I have observed that the issue is in a pod being prematurely added to the service endpoints while it is still in the initialization phase, specifically when the init container starts. This premature addition leads to client errors as the pod is not fully ready to handle requests.
Logs:
2024-05-29 11:29:00 vertex-triton-server-6d64b9586d-pjfl9 0/1 Pending 0 2m17s <none> gke-vertex-serving-cluster-gpupool-92732217-22df <none> <none>
2024-05-29 11:29:01 vertex-triton-server-6d64b9586d-pjfl9 0/1 Init:0/1 0 2m18s <none> gke-vertex-serving-cluster-gpupool-92732217-22df <none> <none>
2024-05-29 11:29:03 vertex-triton-server-6d64b9586d-pjfl9 0/1 Init:0/1 0 2m20s 10.4.4.4 gke-vertex-serving-cluster-gpupool-92732217-22df <none> <none>
2024-05-29 11:29:10 vertex-triton-server-6d64b9586d-pjfl9 0/1 PodInitializing 0 2m27s 10.4.4.4 gke-vertex-serving-cluster-gpupool-92732217-22df <none> <none>
During scaling up, when a new pod is in the Init:0/1 state, it is already being assigned an IP (10.4.4.4). This results in client errors as mentioned above:
tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed
However, I am able to get the READY status correctly from the pods. So, the issue is probably not related to the readiness probe.
@patriksabol very interesting problem. Pods should not be selected by a service until they're running, have passed their startup and readiness probes.
In your second post it appears that none of the pods are past the init container stage, yet you're seeing their readiness probe succeeding? Is that correct?
Given the specific error of tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed
, I'd like to see the definition of your service as well. There could be something in that which is leading to the problem.
In the mean time, I'll review Triton code to see we've somehow introduced any timing issues w.r.t. readiness/liveness probes.
@whoisj In my second post, there is only one pod at four different times. I wanted to show that an IP address was assigned during the init state.
Meanwhile, I have removed the initContainer, and now the IP address is assigned to a running pod:
2024-05-29 15:13:45 vertex-triton-server-74c9fcf77f-q9gmf 0/1 ContainerCreating 0 4m45s <none> gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9 <none> <none>
2024-05-29 15:15:07 vertex-triton-server-74c9fcf77f-q9gmf 0/1 Running 0 6m7s 10.96.5.4 gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9 <none> <none>
2024-05-29 15:15:46 vertex-triton-server-74c9fcf77f-q9gmf 0/1 Running 0 6m46s 10.96.5.4 gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9 <none> <none>
2024-05-29 15:15:46 vertex-triton-server-74c9fcf77f-q9gmf 1/1 Running 0 6m46s 10.96.5.4 gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9 <none> <none>
But I am seeing READY status correctly for pods, meaning when READY is 1/1, using this command:
kubectl get pods -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")]}'
However, the issue with the "Socket closed" error still persists. When all pods are in the READY state, there are no "Socket closed" errors.
This is my client code:
with grpcclient.InferenceServerClient("IP_ADDRESS:8001") as client: # Note the change to grpcclient and typically a different port
input0 = grpcclient.InferInput("tile", [1, 3, image.shape[0], image.shape[1]], "UINT8")
input0.set_data_from_numpy(np.expand_dims(np.moveaxis(image, -1, 0), axis=0))
# Prepare the model_name input as an array of bytes
model_name_bytes = np.array([model_name.encode('utf-8')])
model_name_bytes = np.expand_dims(model_name_bytes, axis=0)
try:
input1 = grpcclient.InferInput("model_name", [1, 1], "BYTES")
input1.set_data_from_numpy(model_name_bytes)
outputs = [
grpcclient.InferRequestedOutput("geojson_output")
]
response = client.infer('cartographer_model', [input0, input1], outputs=outputs, model_version="1")
geojson_result = response.as_numpy("geojson_output").tobytes().decode('utf-8')
print(f'{strftime("%Y-%m-%d %H:%M:%S")} [INFO] Received GeoJSON response')
except Exception as e:
print(f'{strftime("%Y-%m-%d %H:%M:%S")} [ERROR] {e}')
print(f'{strftime("%Y-%m-%d %H:%M:%S")} [ERROR] Failed to receive GeoJSON response')
This is service definition:
apiVersion: v1
kind: Service
metadata:
name: vertex-triton-server-service
labels:
app: vertex-triton-server
spec:
type: LoadBalancer
ports:
- port: 8000
targetPort: 8000
name: http
- port: 8001
targetPort: 8001
name: grpc
- port: 8002
targetPort: 8002
name: metrics
selector:
app: vertex-triton-server
in your service definition, I believe targetPort
should be the name of the port in the target container.
apiVersion: v1
kind: Service
metadata:
name: vertex-triton-server-service
labels:
app: vertex-triton-server
spec:
type: LoadBalancer
ports:
- port: 8000
targetPort: http-triton
name: http
- port: 8001
targetPort: grpc-triton
name: grpc
- port: 8002
targetPort: metrics-triton
name: metrics
selector:
app: vertex-triton-server
By specifying the numeric port number, you could somehow be bypassing the service's selector. I am not 100% sure, but I think it's worth trying the port names instead to see if it resolves the issue or not. Let me know.
Unfortunately, that does not work. It seems that passing values by name is just for clarity of configuration.
As I mentioned, given the error, it appears to be a problem with the service and not with Triton Server.
Perhaps could check Triton Server logs to see if any inference requests are even being sent to the pods in question.
I have deployed the Triton Inference Server to Google Cloud Platform and I am using a Horizontal Pod Autoscaler (HPA). During normal operation with one or three pods running, the LoadBalancer distributes requests across all pods correctly. However, during LoadBalancer's scaling up from one to three pods, some requests fail with a "Socket closed" error while the new pods are in the PodInitializing state. This issue likely isn't Triton-specific but related to how traffic is being routed to pods that aren't fully ready. I am using all Probes.
I am testing this scenario using a script that sends multiple requests (50) in parallel. Here is my current configuration: