Open heyselbi opened 1 year ago
PR for creation of grpc passthrough route: https://github.com/opendatahub-io/odh-model-controller/pull/35/files
We are currently blocked on this issue. I have followed all the instructions shown in [1] and [2]. While grpcurl works and inferences as expected, one of the two rest-proxy
containers is showing connection issues.
Cluster details:
OpenShift 4.13.0
Open Data Hub 1.7.0
Modelmesh version: v0.11.0-alpha (ref)
Controller namespace: opendatahub
User/isvc namespace: modelmesh-serving
Custom ConfigMap:
kind: ConfigMap
apiVersion: v1
metadata:
name: model-serving-config
namespace: opendatahub
data:
config.yaml: |
tls:
secretName: mm-new
Secret mm-new
kind: Secret
apiVersion: v1
metadata:
name: mm-new
namespace: opendatahub
data:
tls.crt: <hidden>
tls.key: <hidden>
type: kubernetes.io/tls
rest-proxy
container with failing logs:
{"level":"info","ts":"2023-07-10T15:31:14Z","msg":"Starting REST Proxy..."}
{"level":"info","ts":"2023-07-10T15:31:14Z","msg":"Using TLS"}
{"level":"info","ts":"2023-07-10T15:31:14Z","msg":"Registering gRPC Inference Service Handler","Host":"localhost","Port":8033,"MaxCallRecvMsgSize":16777216}
{"level":"info","ts":"2023-07-10T15:31:19Z","msg":"Listening on port 8008 with TLS"}
2023/07/10 15:31:23 http: TLS handshake error from <IP1>:50510: read tcp <IP3>:8008-><IP1>:50510: read: connection reset by peer
2023/07/10 15:31:23 http: TLS handshake error from <IP2>:47526: read tcp <IP3>:8008-><IP2>:47526: read: connection reset by peer
2023/07/10 15:31:28 http: TLS handshake error from <IP1>:50518: read tcp <IP3>:8008-><IP1>:50518: read: connection reset by peer
error keeps repeating. The second rest-proxy
container isn't showing failing logs:
{"level":"info","ts":"2023-07-10T13:20:00Z","msg":"Starting REST Proxy..."}
{"level":"info","ts":"2023-07-10T13:20:00Z","msg":"Using TLS"}
{"level":"info","ts":"2023-07-10T13:20:00Z","msg":"Registering gRPC Inference Service Handler","Host":"localhost","Port":8033,"MaxCallRecvMsgSize":16777216}
{"level":"info","ts":"2023-07-10T13:20:05Z","msg":"Listening on port 8008 with TLS"}
Deployment yaml:
apiVersion: apps/v1
metadata:
annotations:
deployment.kubernetes.io/revision: '2'
namespace: modelmesh-serving
labels:
app.kubernetes.io/instance: modelmesh-controller
app.kubernetes.io/managed-by: modelmesh-controller
app.kubernetes.io/name: modelmesh-controller
modelmesh-service: modelmesh-serving
name: modelmesh-serving-ovms-1.x
spec:
replicas: 2
selector:
matchLabels:
modelmesh-service: modelmesh-serving
name: modelmesh-serving-ovms-1.x
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/instance: modelmesh-controller
app.kubernetes.io/managed-by: modelmesh-controller
app.kubernetes.io/name: modelmesh-controller
modelmesh-service: modelmesh-serving
name: modelmesh-serving-ovms-1.x
annotations:
prometheus.io/path: /metrics
prometheus.io/port: '2112'
prometheus.io/scheme: https
prometheus.io/scrape: 'true'
spec:
restartPolicy: Always
serviceAccountName: modelmesh-serving-sa
schedulerName: default-scheduler
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- amd64
terminationGracePeriodSeconds: 90
securityContext: {}
containers:
- resources:
limits:
cpu: '1'
memory: 512Mi
requests:
cpu: 50m
memory: 96Mi
terminationMessagePath: /dev/termination-log
name: rest-proxy
env:
- name: REST_PROXY_LISTEN_PORT
value: '8008'
- name: REST_PROXY_GRPC_PORT
value: '8033'
- name: REST_PROXY_USE_TLS
value: 'true'
- name: REST_PROXY_GRPC_MAX_MSG_SIZE_BYTES
value: '16777216'
- name: MM_TLS_KEY_CERT_PATH
value: /opt/kserve/mmesh/tls/tls.crt
- name: MM_TLS_PRIVATE_KEY_PATH
value: /opt/kserve/mmesh/tls/tls.key
ports:
- name: http
containerPort: 8008
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: tls-certs
readOnly: true
mountPath: /opt/kserve/mmesh/tls
terminationMessagePolicy: File
image: 'quay.io/opendatahub/rest-proxy:v0.10.0'
- resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
readinessProbe:
httpGet:
path: /oauth/healthz
port: 8443
scheme: HTTPS
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
name: oauth-proxy
livenessProbe:
httpGet:
path: /oauth/healthz
port: 8443
scheme: HTTPS
initialDelaySeconds: 30
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
ports:
- name: https
containerPort: 8443
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: proxy-tls
mountPath: /etc/tls/private
terminationMessagePolicy: File
image: >-
registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46
args:
- '--https-address=:8443'
- '--provider=openshift'
- '--openshift-service-account="modelmesh-serving-sa"'
- '--upstream=http://localhost:8008'
- '--tls-cert=/etc/tls/private/tls.crt'
- '--tls-key=/etc/tls/private/tls.key'
- '--cookie-secret=SECRET'
- >-
--openshift-delegate-urls={"/": {"namespace": "modelmesh-serving",
"resource": "services", "verb": "get"}}
- >-
--openshift-sar={"namespace": "modelmesh-serving", "resource":
"services", "verb": "get"}
- '--skip-auth-regex=''(^/metrics|^/apis/v1beta1/healthz)'''
- resources:
limits:
cpu: '5'
memory: 1Gi
requests:
cpu: 500m
memory: 1Gi
terminationMessagePath: /dev/termination-log
lifecycle:
preStop:
httpGet:
path: /prestop
port: 8090
scheme: HTTP
name: ovms
securityContext:
capabilities:
drop:
- ALL
imagePullPolicy: IfNotPresent
volumeMounts:
- name: models-dir
mountPath: /models
terminationMessagePolicy: File
image: 'quay.io/opendatahub/openvino_model_server:2022.3-release'
args:
- '--port=8001'
- '--rest_port=8888'
- '--config_path=/models/model_config_list.json'
- '--file_system_poll_wait_seconds=0'
- '--grpc_bind_address=127.0.0.1'
- '--rest_bind_address=127.0.0.1'
- resources:
limits:
cpu: '2'
memory: 512Mi
requests:
cpu: 50m
memory: 96Mi
terminationMessagePath: /dev/termination-log
lifecycle:
preStop:
httpGet:
path: /prestop
port: 8090
scheme: HTTP
name: ovms-adapter
command:
- /opt/app/ovms-adapter
env:
- name: ADAPTER_PORT
value: '8085'
- name: RUNTIME_PORT
value: '8888'
- name: RUNTIME_DATA_ENDPOINT
value: 'port:8001'
- name: CONTAINER_MEM_REQ_BYTES
valueFrom:
resourceFieldRef:
containerName: ovms
resource: requests.memory
divisor: '0'
- name: MEM_BUFFER_BYTES
value: '134217728'
- name: LOADTIME_TIMEOUT
value: '90000'
- name: USE_EMBEDDED_PULLER
value: 'true'
- name: RUNTIME_VERSION
value: 2022.3-release
securityContext:
capabilities:
drop:
- ALL
imagePullPolicy: IfNotPresent
volumeMounts:
- name: models-dir
mountPath: /models
- name: storage-config
readOnly: true
mountPath: /storage-config
terminationMessagePolicy: File
image: 'quay.io/opendatahub/modelmesh-runtime-adapter:v0.11.0-alpha'
- resources:
limits:
cpu: '3'
memory: 448Mi
requests:
cpu: 300m
memory: 448Mi
readinessProbe:
httpGet:
path: /ready
port: 8089
scheme: HTTP
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
lifecycle:
preStop:
exec:
command:
- /opt/kserve/mmesh/stop.sh
- wait
name: mm
livenessProbe:
httpGet:
path: /live
port: 8089
scheme: HTTP
initialDelaySeconds: 90
timeoutSeconds: 5
periodSeconds: 30
successThreshold: 1
failureThreshold: 2
env:
- name: MM_SERVICE_NAME
value: modelmesh-serving
- name: MM_SVC_GRPC_PORT
value: '8033'
- name: WKUBE_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: WKUBE_POD_IPADDR
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: MM_LOCATION
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: KV_STORE
value: 'etcd:/opt/kserve/mmesh/etcd/etcd_connection'
- name: MM_METRICS
value: 'prometheus:port=2112;scheme=https'
- name: SHUTDOWN_TIMEOUT_MS
value: '90000'
- name: INTERNAL_SERVING_GRPC_PORT
value: '8001'
- name: INTERNAL_GRPC_PORT
value: '8085'
- name: MM_SVC_GRPC_MAX_MSG_SIZE
value: '16777216'
- name: MM_KVSTORE_PREFIX
value: mm
- name: MM_DEFAULT_VMODEL_OWNER
value: ksp
- name: MM_LABELS
value: 'mt:openvino_ir,mt:openvino_ir:opset1,pv:grpc-v1,rt:ovms-1.x'
- name: MM_TYPE_CONSTRAINTS_PATH
value: /etc/watson/mmesh/config/type_constraints
- name: MM_DATAPLANE_CONFIG_PATH
value: /etc/watson/mmesh/config/dataplane_api_config
- name: MM_TLS_KEY_CERT_PATH
value: /opt/kserve/mmesh/tls/tls.crt
- name: MM_TLS_PRIVATE_KEY_PATH
value: /opt/kserve/mmesh/tls/tls.key
securityContext:
capabilities:
drop:
- ALL
ports:
- name: grpc
containerPort: 8033
protocol: TCP
- name: prometheus
containerPort: 2112
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: tc-config
mountPath: /etc/watson/mmesh/config
- name: etcd-config
readOnly: true
mountPath: /opt/kserve/mmesh/etcd
- name: tls-certs
readOnly: true
mountPath: /opt/kserve/mmesh/tls
terminationMessagePolicy: File
image: 'quay.io/opendatahub/modelmesh:v0.11.0-alpha'
serviceAccount: modelmesh-serving-sa
volumes:
- name: proxy-tls
secret:
secretName: model-serving-proxy-tls
defaultMode: 420
- name: models-dir
emptyDir:
sizeLimit: 1536Mi
- name: storage-config
secret:
secretName: storage-config
defaultMode: 420
- name: tc-config
configMap:
name: tc-config
defaultMode: 420
- name: etcd-config
secret:
secretName: model-serving-etcd
defaultMode: 420
- name: tls-certs
secret:
secretName: mm-new
defaultMode: 420
dnsPolicy: ClusterFirst
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 15%
maxSurge: 75%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
status:
observedGeneration: 6
replicas: 2
updatedReplicas: 2
readyReplicas: 2
availableReplicas: 2
conditions:
- type: Progressing
status: 'True'
lastUpdateTime: '2023-07-07T17:49:05Z'
lastTransitionTime: '2023-07-07T16:17:16Z'
reason: NewReplicaSetAvailable
message: >-
ReplicaSet "modelmesh-serving-ovms-1.x-6cdbbbbc79" has successfully
progressed.
- type: Available
status: 'True'
lastUpdateTime: '2023-07-10T15:41:34Z'
lastTransitionTime: '2023-07-10T15:41:34Z'
reason: MinimumReplicasAvailable
message: Deployment has minimum availability.
Opened the issue in upstream kserve/modelmesh as well: https://github.com/kserve/modelmesh-serving/issues/401
While waiting for response from upstream, my next steps are:
We would like to enable gRPC inferencing in modelmesh-serving by exposing a route.
Relevant docs: [1] Exposing route: https://github.com/kserve/modelmesh-serving/tree/main/docs/configuration#exposing-an-external-endpoint-using-an-openshift-route [2] Self-signed TLS: https://github.com/kserve/modelmesh-serving/blob/main/docs/configuration/tls.md#generating-tls-certificates-for-devtest-using-openssl
Route created is passthrough. Re-encrypt could be an option too but I haven't been successful in getting the grpcurl tests successful with it yet.