opendatahub-io / caikit-tgis-serving

Apache License 2.0
19 stars 44 forks source link

GRPC endpoint not responding properly after the InferenceService reports as `Loaded` #146

Open kpouget opened 10 months ago

kpouget commented 10 months ago

As part of my automated scale test, I observe that the InferenceService sometimes reports as Loaded, but the call to GRPC endpoint returns with errors.

Examples:

<command>
set -o pipefail;
i=0;

GRPCURL_DATA=$(cat "subprojects/llm-load-test/openorca-subset-006.json" | jq .dataset[$i].input )

grpcurl    -insecure    -d "$GRPCURL_DATA"    -H "mm-model-id: flan-t5-small-caikit"    u0-m7-predictor-watsonx-serving-scale-test-u0.apps.psap-watsonx-dgxa100.perf.lab.eng.bos.redhat.com:443    caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict
</command>

<stderr> ERROR:
<stderr>   Code: Unavailable
<stderr>   Message: connections to all backends failing; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused
<command>
set -o pipefail;
set -e;
dest=/mnt/logs/016__watsonx_serving__validate_model_all/u0-m6/answers.json
queries=/mnt/logs/016__watsonx_serving__validate_model_all/u0-m6/questions.json
rm -f "$dest" "$queries"

for i in $(seq 10); do
  GRPCURL_DATA=$(cat "subprojects/llm-load-test/openorca-subset-006.json" | jq .dataset[$i].input )
  echo $GRPCURL_DATA >> "$queries"
  grpcurl    -insecure    -d "$GRPCURL_DATA"    -H "mm-model-id: flan-t5-small-caikit"    u0-m6-predictor-watsonx-serving-scale-test-u0.apps.psap-watsonx-dgxa100.perf.lab.eng.bos.redhat.com:443    caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict    >> "$dest"
  echo "Call $i/10 passed"
done
</command>

<stdout> Call 1/10 passed
<stdout> Call 2/10 passed
<stdout> Call 3/10 passed
<stdout> Call 4/10 passed
<stdout> Call 5/10 passed
<stdout> Call 6/10 passed
<stdout> Call 7/10 passed
<stdout> Call 8/10 passed
<stdout> Call 9/10 passed
<stderr> ERROR:
<stderr>   Code: Unavailable
<stderr>   Message: error reading from server: EOF

Versions

NAME                          DISPLAY                                          VERSION    REPLACES                                   PHASE
jaeger-operator.v1.47.1-5     Red Hat OpenShift distributed tracing platform   1.47.1-5   jaeger-operator.v1.47.0-2-0.1696814090.p   Succeeded
kiali-operator.v1.65.9        Kiali Operator                                   1.65.9     kiali-operator.v1.65.8                     Succeeded
rhods-operator.2.3.0          Red Hat OpenShift Data Science                   2.3.0      rhods-operator.2.2.0                       Succeeded
serverless-operator.v1.30.1   Red Hat OpenShift Serverless                     1.30.1     serverless-operator.v1.30.0                Succeeded
servicemeshoperator.v2.4.4    Red Hat OpenShift Service Mesh                   2.4.4-0    servicemeshoperator.v2.4.3                 Succeeded
quay.io/opendatahub/text-generation-inference@sha256:0e3d00961fed95a8f8b12ed7ce50305acbbfe37ee33d37e81ba9e7ed71c73b69
quay.io/opendatahub/caikit-tgis-serving@sha256:ed920d21a4ba24643c725a96b762b114b50f580e6fee198f7ccd0bc73a95a6ab
kpouget commented 10 months ago

I could work around the issue by increasing the memory limit of the Istio egress/ingress Pods (to 4GB, to be safe):

apiVersion: maistra.io/v2
kind: ServiceMeshControlPlane
metadata:
  name: minimal
  namespace: istio-system
spec:
  gateways:
    egress:
      runtime:
        container:
          resources:
            limits:
              memory: 4Gi
    ingress:
      runtime:
        container:
          resources:
            limits:
              memory: 4Gi

image image

but this wasn't happening a few weeks ago, with RHOAI 2.1.0 and 300 models (when running on AWS with 35 nodes, whereas this bug occured on a single-node OpenShift)

image image

Can this be a regression, or is it somehow expected?

bartoszmajsak commented 9 months ago

@kpouget I am wondering if we can get some insights into these metrics as well:

bartoszmajsak commented 9 months ago

but this wasn't happening a few weeks ago, with RHOAI 2.1.0 and 300 models (when running on AWS with 35 nodes, whereas this bug occured on a single-node OpenShift)

@kpouget was it also running on istio underneath? if so - how was it configured?

kpouget commented 9 months ago

@kpouget was it also running on istio underneath? if so - how was it configured?

yes it was. Istio was using these files for configuration (pinned commit from what I used at the time of the test)

bartoszmajsak commented 9 months ago

I managed to reduce resource consumption roughly by half. Here's the script which you can apply.

In short this script:

#!/bin/bash

cat <<EOF > smcp-patch.yaml 
apiVersion: maistra.io/v2
kind: ServiceMeshControlPlane
metadata:  
  name: data-science-smcp
  namespace: istio-system  
spec:
  gateways:
    egress:
      runtime:
        container:
          resources:
            limits:
              cpu: 1024m
              memory: 4G
            requests:
              cpu: 128m
              memory: 1G
    ingress:
      runtime:
        container:
          resources:
            limits:
              cpu: 1024m
              memory: 4G
            requests:
              cpu: 128m
              memory: 1G
  runtime:
    components:
      pilot:
        container:
          env:
            PILOT_FILTER_GATEWAY_CLUSTER_CONFIG: "true"
          resources:
            limits:
              cpu: 1024m
              memory: 4G
            requests:
              cpu: 128m
              memory: 1024Mi

EOF

trap '{ rm -rf -- smcp-patch.yaml; }' EXIT

kubectl patch smcp/data-science-smcp -n istio-system --type=merge --patch-file smcp-patch.yaml 

namespaces=$(kubectl get ns -ltopsail.scale-test -o name | cut -d'/' -f 2)

# limit sidecarproxy endpoints to its own ns and istio-system
for ns in $namespaces; do
    cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: default
  namespace: $ns
spec:
  egress:
  - hosts:
    - "./*"
    - "istio-system/*"
EOF
done

# force changes to take effect
for ns in $namespaces; do
    kubectl delete pods --all -n "${ns}"
done

# force re-creation of all pods with envoy service registry rebuilt
kubectl delete pods --all -n istio-system

Initial state

❯ istioctl proxy-config endpoint deployment/istio-ingressgateway -n istio-system | wc -l
1052

❯ istioctl proxy-config endpoint $(kubectl get pods -o name -n watsonx-scale-test-u1) -n watsonx-scale-test-u1 | wc -l
1065

❯ kubectl top pods -n istio-system
NAME                                        CPU(cores)   MEMORY(bytes)   
istio-egressgateway-6b7fdb6cb9-lh5jg        100m         2519Mi          
istio-ingressgateway-7dbdc66dd7-nkxxq       91m          2320Mi          
istiod-data-science-smcp-65f4877fff-tndf4   82m          1392Mi 

❯ kubectl k top pods -n watsonx-scale-test-u0 --containers
POD                                               NAME                    CPU(cores)   MEMORY(bytes)   
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   POD                     0m           0Mi             
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   istio-proxy             14m          372Mi           
...

Modifications

❯ istioctl proxy-config endpoint deployment/istio-ingressgateway -n istio-system | wc -l
1052 // it knows the whole world, so that is the same

❯ istioctl proxy-config endpoint $(kubectl get pods -o name -n watsonx-scale-test-u1) -n watsonx-scale-test-u1 | wc -l
34

❯ kubectl top pods -n istio-system
NAME                                        CPU(cores)   MEMORY(bytes)   
istio-egressgateway-5778df8594-j869r        83m          444Mi           
istio-ingressgateway-6847d4b974-sk25z       77m          946Mi           
istiod-data-science-smcp-5568884d7d-45zkz   36m          950Mi 

❯ kubectl k top pods -n watsonx-scale-test-u0 --containers
POD                                               NAME                    CPU(cores)   MEMORY(bytes)   
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   POD                     0m           0Mi             
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   istio-proxy             6m           136Mi           
...