kserve / modelmesh-serving

Controller for ModelMesh
Apache License 2.0
206 stars 115 forks source link

Quickstart broken: `modelmesh-serving` svc not created in `--namespace-scope-mode`, ISVCs never get to `READY` state #433

Closed ckadner closed 1 year ago

ckadner commented 1 year ago

Describe the bug

The quickstart install instructions no longer work correctly. After deploying a model, the InferenceService does not get into READY state. Inference requests cannot be performed.

To Reproduce

  1. Follow the Quickstart guide the point of deploying a model and wait for the runtime pods to be Running.
$ kubectl get pods

NAME                                              READY   STATUS    RESTARTS   AGE
etcd-558fb445c-ht596                              1/1     Running   0          9h
minio-774d48cf7-nbkd7                             1/1     Running   0          9h
modelmesh-controller-86dd54cc54-xhnvv             1/1     Running   0          9h
modelmesh-serving-mlserver-1.x-647d4f8f9c-h5b49   4/4     Running   0          26m
modelmesh-serving-mlserver-1.x-647d4f8f9c-hv6ht   4/4     Running   0          26m
  1. Wait a few minutes (or hours) and check the status of the newly created InferenceService.
$ kubectl get isvc

NAME                   URL   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
example-sklearn-isvc                                                                               26m
  1. Describe the ISVC, find no events:
$ kubectl describe isvc example-sklearn-isvc

Name:         example-sklearn-isvc
Namespace:    modelmesh-serving
Labels:       <none>
Annotations:  serving.kserve.io/deploymentMode: ModelMesh
              serving.kserve.io/secretKey: localMinIO
API Version:  serving.kserve.io/v1beta1
Kind:         InferenceService
Metadata:
  Creation Timestamp:  2023-09-20T03:21:13Z
  Generation:          2
  Managed Fields:
    API Version:  serving.kserve.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
          f:serving.kserve.io/deploymentMode:
          f:serving.kserve.io/secretKey:
      f:spec:
        .:
        f:predictor:
          .:
          f:model:
            .:
            f:modelFormat:
              .:
              f:name:
            f:storageUri:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2023-09-20T03:46:54Z
  Resource Version:  61292
  UID:               2d2acd78-e0ce-4864-bd3f-54944d02ae85
Spec:
  Predictor:
    Model:
      Model Format:
        Name:       sklearn
      Storage Uri:  s3://modelmesh-example-models/sklearn/mnist-svm.joblib
Events:             <none>
  1. Ignore all that for a minute and go on to create the port-forward:
$ kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8033 -n modelmesh-serving

Error from server (NotFound): services "modelmesh-serving" not found
  1. Check the service list:

$ kubectl get svc

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
etcd                               ClusterIP   172.21.141.59    <none>        2379/TCP   9h
minio                              ClusterIP   172.21.11.102    <none>        9000/TCP   9h
modelmesh-webhook-server-service   ClusterIP   172.21.232.100   <none>        9443/TCP   9h

Expected behavior

Up until release v0.11.0 when following the instructions in the Quickstart guide the inference service was created successfully, port-forward worked and inference requests returned expected response.

Additional context

How do the FVT test cases work? Curiously, when using the FVT install instructions using the --fvt flag (cluster-scoped install), the modelmesh-serving service does get created and the ISVC transitions to READY state okay.

$ kubectl get svc

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
etcd                               ClusterIP   172.21.33.247    <none>        2379/TCP                     2m7s
minio                              ClusterIP   172.21.215.204   <none>        9000/TCP                     2m6s
modelmesh-serving                  ClusterIP   None             <none>        8033/TCP,8008/TCP,2112/TCP   75s
modelmesh-webhook-server-service   ClusterIP   172.21.82.140    <none>        9443/TCP                     104s

Environment (please complete the following information):


UPDATE 2023-09-21:

FYI @Jooho

I went through some of the recent changes and it looks like PR #397 introduced this changed behavior. Could it be that the way your logic changes on how namespaces are reconciled affects how/when/whether the modelmesh-serving service gets created?

W0921   1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: 
  failed to list *v1.Namespace: namespaces is forbidden: 
    User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list 
      resource "namespaces" in API group "" at the cluster scope

E0921   1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169:
  Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden:
    User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
      resource "namespaces" in API group "" at the cluster scope
ckadner commented 1 year ago

@rafvasq just pointed out that in cluster-scope mode (FVT) it did work

ckadner commented 1 year ago

@Jooho

I went through some of the recent changes and it looks like PR #397 introduced this changed behavior. Could it be that the way your logic changes on how namespaces are reconciled affects how/when/whether the modelmesh-serving service gets created?

ckadner commented 1 year ago

I tagged some recent commits on main and triggered builds using the GH actions.

```shell git tag -a pr-422 0b39182 -m "PR #422" git push upstream pr-422 ​ git tag -a pr-397 dd7277a -m "PR #397" git push upstream pr-397 ​ git tag -a pr-428 f47cd7b -m "PR #428" git push upstream pr-428 ```

The new pr-xxx images on Dockerhub help to "go back in time" and to deploy older versions of the modelmesh-controller right after a PR was merged.

ckadner commented 1 year ago

Before PR #397

git checkout pr-428

HEAD is now at f47cd7b fix: Support devel version of Kustomize in install script (#428)
# replace newTag for controller image
git diff | grep -E "^\+"

+++ b/config/manager/kustomization.yaml
+    newTag: pr-428
kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace-scope-mode --namespace modelmesh-serving --quickstart --enable-self-signed-ca

Successfully installed ModelMesh Serving!
kubectl get svc

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
etcd                               ClusterIP   172.21.171.73    <none>        2379/TCP                     2m7s
minio                              ClusterIP   172.21.190.108   <none>        9000/TCP                     2m7s
modelmesh-serving                  ClusterIP   None             <none>        8033/TCP,8008/TCP,2112/TCP   99s
modelmesh-webhook-server-service   ClusterIP   172.21.99.254    <none>        9443/TCP                     110s

After PR #397

git checkout pr-397

HEAD is now at dd7277a fix: Goroutine memory leak (#397)
# replace newTag for controller image
git diff | grep -E "^\+"

+++ b/config/manager/kustomization.yaml
+    newTag: pr-397
kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace-scope-mode --namespace modelmesh-serving --quickstart --enable-self-signed-ca

Successfully installed ModelMesh Serving!
kubectl get svc

NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
etcd                               ClusterIP   172.21.219.47   <none>        2379/TCP   6m37s
minio                              ClusterIP   172.21.177.41   <none>        9000/TCP   6m36s
modelmesh-webhook-server-service   ClusterIP   172.21.249.37   <none>        9443/TCP   6m14s

Missing service modelmesh-serving.

ckadner commented 1 year ago

Some differences in the controller logs, before and after PR #397


Unique log messages before PR #397 (don't happen after PR #397)

D   KubeResolver            Built new resolver                 {"target": {"Scheme":"kube","Authority":"","URL":{"Scheme":"kube","Opaque":"","User":null,"Host":"","Path":"/modelmesh-serving.modelmesh-serving:8033","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}, "name": "modelmesh-serving/modelmesh-serving"}
D   KubeResolver            Failed to update resolver due to bad state, requeuing endpoint reconciliation
D   KubeResolver.Reconcile  Ignoring event for Endpoints with no resolver   {"endpoints": "modelmesh-serving/modelmesh-serving"}
D   ModelMeshEventStream    Etcd config secret changed. Creating a new etcd client and restarting watchers. {"oldSecretName": "", "newSecretName": "model-serving-etcd"}
D   ModelMeshEventStream    ModelMesh Model Event              {"namespace": "modelmesh-serving", "modelId": "", "event": "INITIALIZED"}
D   ModelMeshEventStream    ModelMesh VModel Event             {"namespace": "modelmesh-serving", "vModelId": "", "event": "INITIALIZED"}
I   MMService               Established new MM gRPC connection {"namespace": "modelmesh-serving", "endpoint": "kube:///modelmesh-serving.modelmesh-serving:8033", "TLS": false}
I   ModelMeshEventStream    Initialize Model Event Stream      {"namespace": "modelmesh-serving", "servicePrefix": "modelmesh-serving/mm/modelmesh-serving"}
I   ModelMeshEventStream    EtcdRangeWatcher starting          {"namespace": "modelmesh-serving", "WatchPrefix": "modelmesh-serving/mm/modelmesh-serving/registry/"}
I   ModelMeshEventStream    EtcdRangeWatcher starting          {"namespace": "modelmesh-serving", "WatchPrefix": "modelmesh-serving/mm/modelmesh-serving/vmodels/"}
I   controllers.Service     Updated Kube Service               {"namespace": "modelmesh-serving", "name": "modelmesh-serving", }

Unique log messages after PR #397 (did not happen before PR #397)

E0921   1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169:
  Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden:
    User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
      resource "namespaces" in API group "" at the cluster scope

W0921   1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169:
  failed to list *v1.Namespace: namespaces is forbidden:
    User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
      resource "namespaces" in API group "" at the cluster scope

Different log messages (changed after PR `#397`) - before PR `#397`: ``` I0921 1 request.go:690] Waited for 1s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/coordination.k8s.io/v1?timeout=32s ``` - after PR `#397`: ``` I0921 1 request.go:690] Waited for 1s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/node.k8s.io/v1?timeout=32s ```
ckadner commented 1 year ago

FYI @Jooho @tjohnson31415

ckadner commented 1 year ago

We should create a separate FVT workflow to test the namespace-scope mode -- to run in parallel to the existing cluster-scope FVT