Closed ckadner closed 1 year ago
@rafvasq just pointed out that in cluster-scope mode (FVT) it did work
@Jooho
I went through some of the recent changes and it looks like PR #397 introduced this changed behavior. Could it be that the way your logic changes on how namespaces are reconciled affects how/when/whether the modelmesh-serving
service gets created?
I tagged some recent commits on main
and triggered builds using the GH actions.
The new pr-xxx
images on Dockerhub help to "go back in time" and to deploy older versions of the modelmesh-controller
right after a PR was merged.
Before PR #397
git checkout pr-428
HEAD is now at f47cd7b fix: Support devel version of Kustomize in install script (#428)
# replace newTag for controller image
git diff | grep -E "^\+"
+++ b/config/manager/kustomization.yaml
+ newTag: pr-428
kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace-scope-mode --namespace modelmesh-serving --quickstart --enable-self-signed-ca
Successfully installed ModelMesh Serving!
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd ClusterIP 172.21.171.73 <none> 2379/TCP 2m7s
minio ClusterIP 172.21.190.108 <none> 9000/TCP 2m7s
modelmesh-serving ClusterIP None <none> 8033/TCP,8008/TCP,2112/TCP 99s
modelmesh-webhook-server-service ClusterIP 172.21.99.254 <none> 9443/TCP 110s
After PR #397
git checkout pr-397
HEAD is now at dd7277a fix: Goroutine memory leak (#397)
# replace newTag for controller image
git diff | grep -E "^\+"
+++ b/config/manager/kustomization.yaml
+ newTag: pr-397
kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace-scope-mode --namespace modelmesh-serving --quickstart --enable-self-signed-ca
Successfully installed ModelMesh Serving!
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd ClusterIP 172.21.219.47 <none> 2379/TCP 6m37s
minio ClusterIP 172.21.177.41 <none> 9000/TCP 6m36s
modelmesh-webhook-server-service ClusterIP 172.21.249.37 <none> 9443/TCP 6m14s
Missing service modelmesh-serving
.
Some differences in the controller logs, before and after PR #397
D
= DEBUG
, I
= INFO
#397
(don't happen after PR #397
)D KubeResolver Built new resolver {"target": {"Scheme":"kube","Authority":"","URL":{"Scheme":"kube","Opaque":"","User":null,"Host":"","Path":"/modelmesh-serving.modelmesh-serving:8033","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}, "name": "modelmesh-serving/modelmesh-serving"}
D KubeResolver Failed to update resolver due to bad state, requeuing endpoint reconciliation
D KubeResolver.Reconcile Ignoring event for Endpoints with no resolver {"endpoints": "modelmesh-serving/modelmesh-serving"}
D ModelMeshEventStream Etcd config secret changed. Creating a new etcd client and restarting watchers. {"oldSecretName": "", "newSecretName": "model-serving-etcd"}
D ModelMeshEventStream ModelMesh Model Event {"namespace": "modelmesh-serving", "modelId": "", "event": "INITIALIZED"}
D ModelMeshEventStream ModelMesh VModel Event {"namespace": "modelmesh-serving", "vModelId": "", "event": "INITIALIZED"}
I MMService Established new MM gRPC connection {"namespace": "modelmesh-serving", "endpoint": "kube:///modelmesh-serving.modelmesh-serving:8033", "TLS": false}
I ModelMeshEventStream Initialize Model Event Stream {"namespace": "modelmesh-serving", "servicePrefix": "modelmesh-serving/mm/modelmesh-serving"}
I ModelMeshEventStream EtcdRangeWatcher starting {"namespace": "modelmesh-serving", "WatchPrefix": "modelmesh-serving/mm/modelmesh-serving/registry/"}
I ModelMeshEventStream EtcdRangeWatcher starting {"namespace": "modelmesh-serving", "WatchPrefix": "modelmesh-serving/mm/modelmesh-serving/vmodels/"}
I controllers.Service Updated Kube Service {"namespace": "modelmesh-serving", "name": "modelmesh-serving", }
#397
(did not happen before PR #397
)E0921 1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169:
Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden:
User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
resource "namespaces" in API group "" at the cluster scope
W0921 1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169:
failed to list *v1.Namespace: namespaces is forbidden:
User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
resource "namespaces" in API group "" at the cluster scope
FYI @Jooho @tjohnson31415
We should create a separate FVT workflow to test the namespace-scope mode -- to run in parallel to the existing cluster-scope FVT
Describe the bug
The quickstart install instructions no longer work correctly. After deploying a model, the
InferenceService
does not get intoREADY
state. Inference requests cannot be performed.To Reproduce
Running
.InferenceService
.Expected behavior
Up until release
v0.11.0
when following the instructions in the Quickstart guide the inference service was created successfully, port-forward worked and inference requests returned expected response.Additional context
How do the FVT test cases work? Curiously, when using the FVT install instructions using the
--fvt
flag (cluster-scoped install), themodelmesh-serving
service does get created and the ISVC transitions toREADY
state okay.Environment (please complete the following information):
main
, Kubernetes 1.25, 1.26, 1.27UPDATE 2023-09-21:
FYI @Jooho
I went through some of the recent changes and it looks like PR #397 introduced this changed behavior. Could it be that the way your logic changes on how namespaces are reconciled affects how/when/whether the
modelmesh-serving
service gets created?