GRPC endpoint not responding properly after the InferenceService reports as `Loaded`

kpouget commented 10 months ago

As part of my automated scale test, I observe that the InferenceService sometimes reports as Loaded, but the call to GRPC endpoint returns with errors.

Examples:

<command>
set -o pipefail;
i=0;

GRPCURL_DATA=$(cat "subprojects/llm-load-test/openorca-subset-006.json" | jq .dataset[$i].input )

grpcurl    -insecure    -d "$GRPCURL_DATA"    -H "mm-model-id: flan-t5-small-caikit"    u0-m7-predictor-watsonx-serving-scale-test-u0.apps.psap-watsonx-dgxa100.perf.lab.eng.bos.redhat.com:443    caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict
</command>

<stderr> ERROR:
<stderr>   Code: Unavailable
<stderr>   Message: connections to all backends failing; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused

<command>
set -o pipefail;
set -e;
dest=/mnt/logs/016__watsonx_serving__validate_model_all/u0-m6/answers.json
queries=/mnt/logs/016__watsonx_serving__validate_model_all/u0-m6/questions.json
rm -f "$dest" "$queries"

for i in $(seq 10); do
  GRPCURL_DATA=$(cat "subprojects/llm-load-test/openorca-subset-006.json" | jq .dataset[$i].input )
  echo $GRPCURL_DATA >> "$queries"
  grpcurl    -insecure    -d "$GRPCURL_DATA"    -H "mm-model-id: flan-t5-small-caikit"    u0-m6-predictor-watsonx-serving-scale-test-u0.apps.psap-watsonx-dgxa100.perf.lab.eng.bos.redhat.com:443    caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict    >> "$dest"
  echo "Call $i/10 passed"
done
</command>

<stdout> Call 1/10 passed
<stdout> Call 2/10 passed
<stdout> Call 3/10 passed
<stdout> Call 4/10 passed
<stdout> Call 5/10 passed
<stdout> Call 6/10 passed
<stdout> Call 7/10 passed
<stdout> Call 8/10 passed
<stdout> Call 9/10 passed
<stderr> ERROR:
<stderr>   Code: Unavailable
<stderr>   Message: error reading from server: EOF

Versions

NAME                          DISPLAY                                          VERSION    REPLACES                                   PHASE
jaeger-operator.v1.47.1-5     Red Hat OpenShift distributed tracing platform   1.47.1-5   jaeger-operator.v1.47.0-2-0.1696814090.p   Succeeded
kiali-operator.v1.65.9        Kiali Operator                                   1.65.9     kiali-operator.v1.65.8                     Succeeded
rhods-operator.2.3.0          Red Hat OpenShift Data Science                   2.3.0      rhods-operator.2.2.0                       Succeeded
serverless-operator.v1.30.1   Red Hat OpenShift Serverless                     1.30.1     serverless-operator.v1.30.0                Succeeded
servicemeshoperator.v2.4.4    Red Hat OpenShift Service Mesh                   2.4.4-0    servicemeshoperator.v2.4.3                 Succeeded

quay.io/opendatahub/text-generation-inference@sha256:0e3d00961fed95a8f8b12ed7ce50305acbbfe37ee33d37e81ba9e7ed71c73b69
quay.io/opendatahub/caikit-tgis-serving@sha256:ed920d21a4ba24643c725a96b762b114b50f580e6fee198f7ccd0bc73a95a6ab

kpouget commented 10 months ago

I could work around the issue by increasing the memory limit of the Istio egress/ingress Pods (to 4GB, to be safe):

apiVersion: maistra.io/v2
kind: ServiceMeshControlPlane
metadata:
  name: minimal
  namespace: istio-system
spec:
  gateways:
    egress:
      runtime:
        container:
          resources:
            limits:
              memory: 4Gi
    ingress:
      runtime:
        container:
          resources:
            limits:
              memory: 4Gi

but this wasn't happening a few weeks ago, with RHOAI 2.1.0 and 300 models (when running on AWS with 35 nodes, whereas this bug occured on a single-node OpenShift)

Can this be a regression, or is it somehow expected?

bartoszmajsak commented 9 months ago

@kpouget I am wondering if we can get some insights into these metrics as well:

pilot_xds_push_time_bucket
pilot_proxy_convergence_time_bucket
pilot_proxy_queue_time_bucket

bartoszmajsak commented 9 months ago

but this wasn't happening a few weeks ago, with RHOAI 2.1.0 and 300 models (when running on AWS with 35 nodes, whereas this bug occured on a single-node OpenShift)

@kpouget was it also running on istio underneath? if so - how was it configured?

kpouget commented 9 months ago

@kpouget was it also running on istio underneath? if so - how was it configured?

yes it was. Istio was using these files for configuration (pinned commit from what I used at the time of the test)

bartoszmajsak commented 9 months ago

I managed to reduce resource consumption roughly by half. Here's the script which you can apply.

In short this script:

sets resource constraints for pilot and gateways
enables PILOT_FILTER_GATEWAY_CLUSTER_CONFIG
- this reduces the amount of configuration data that Pilot sends to Istio gateways, specifically the egress and ingress gateways. It filters out unnecessary service registry information that is not relevant to a particular gateway.
limits outbound endpoints populated to sidecar proxies for each of the projects by using Sidecar resource.
- This is based on the assumption that there is no cross-namespace communication in place. If that is not true we have to revise Sidecar settings @Jooho @israel-hdez

#!/bin/bash

cat <<EOF > smcp-patch.yaml 
apiVersion: maistra.io/v2
kind: ServiceMeshControlPlane
metadata:  
  name: data-science-smcp
  namespace: istio-system  
spec:
  gateways:
    egress:
      runtime:
        container:
          resources:
            limits:
              cpu: 1024m
              memory: 4G
            requests:
              cpu: 128m
              memory: 1G
    ingress:
      runtime:
        container:
          resources:
            limits:
              cpu: 1024m
              memory: 4G
            requests:
              cpu: 128m
              memory: 1G
  runtime:
    components:
      pilot:
        container:
          env:
            PILOT_FILTER_GATEWAY_CLUSTER_CONFIG: "true"
          resources:
            limits:
              cpu: 1024m
              memory: 4G
            requests:
              cpu: 128m
              memory: 1024Mi

EOF

trap '{ rm -rf -- smcp-patch.yaml; }' EXIT

kubectl patch smcp/data-science-smcp -n istio-system --type=merge --patch-file smcp-patch.yaml 

namespaces=$(kubectl get ns -ltopsail.scale-test -o name | cut -d'/' -f 2)

# limit sidecarproxy endpoints to its own ns and istio-system
for ns in $namespaces; do
    cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: default
  namespace: $ns
spec:
  egress:
  - hosts:
    - "./*"
    - "istio-system/*"
EOF
done

# force changes to take effect
for ns in $namespaces; do
    kubectl delete pods --all -n "${ns}"
done

# force re-creation of all pods with envoy service registry rebuilt
kubectl delete pods --all -n istio-system

Initial state

❯ istioctl proxy-config endpoint deployment/istio-ingressgateway -n istio-system | wc -l
1052

❯ istioctl proxy-config endpoint $(kubectl get pods -o name -n watsonx-scale-test-u1) -n watsonx-scale-test-u1 | wc -l
1065

❯ kubectl top pods -n istio-system
NAME                                        CPU(cores)   MEMORY(bytes)   
istio-egressgateway-6b7fdb6cb9-lh5jg        100m         2519Mi          
istio-ingressgateway-7dbdc66dd7-nkxxq       91m          2320Mi          
istiod-data-science-smcp-65f4877fff-tndf4   82m          1392Mi 

❯ kubectl k top pods -n watsonx-scale-test-u0 --containers
POD                                               NAME                    CPU(cores)   MEMORY(bytes)   
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   POD                     0m           0Mi             
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   istio-proxy             14m          372Mi           
...

Modifications

❯ istioctl proxy-config endpoint deployment/istio-ingressgateway -n istio-system | wc -l
1052 // it knows the whole world, so that is the same

❯ istioctl proxy-config endpoint $(kubectl get pods -o name -n watsonx-scale-test-u1) -n watsonx-scale-test-u1 | wc -l
34

❯ kubectl top pods -n istio-system
NAME                                        CPU(cores)   MEMORY(bytes)   
istio-egressgateway-5778df8594-j869r        83m          444Mi           
istio-ingressgateway-6847d4b974-sk25z       77m          946Mi           
istiod-data-science-smcp-5568884d7d-45zkz   36m          950Mi 

❯ kubectl k top pods -n watsonx-scale-test-u0 --containers
POD                                               NAME                    CPU(cores)   MEMORY(bytes)   
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   POD                     0m           0Mi             
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   istio-proxy             6m           136Mi           
...

opendatahub-io / caikit-tgis-serving

GRPC endpoint not responding properly after the InferenceService reports as `Loaded` #146

Initial state

Modifications