kubeflow / manifests

A repository for Kustomize manifests
Apache License 2.0
807 stars 869 forks source link

KServe and cert-manager webhooks are failing #2660

Closed biswajit-9776 closed 4 months ago

biswajit-9776 commented 6 months ago

While isntalling Kubeflow using the command:

while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Some webhooks could not be reached:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root ce rtificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
[biswa@fedora manifests]$ sudo kubectl get endpoints -n cert-manager cert-manager-webhook
NAME                   ENDPOINTS          AGE
cert-manager-webhook   10.244.0.8:10250   108m

The K-serve webhook issue was previously encountered in #2553. Should changes made in #2627 prevent reproducing this error? As for cert-manager webhook, #2585 had problem with no route to host while mine has with refused connection. It could be a kubernetes root level issue or deeper networking stack issue as in https://cert-manager.io/docs/troubleshooting/webhook/#cause-2-eks-on-a-custom-cni

kustomize version:

v5.3.0

My kubectl pods are:

[biswa@fedora manifests]$ sudo kubectl get pods -A
NAMESPACE            NAME                                                              READY   STATUS              RESTARTS       AGE
auth                 dex-5d8fffb998-qq49q                                              1/1     Running             0              94m
cert-manager         cert-manager-5b8f9b9d96-l7vj7                                     1/1     Running             0              94m
cert-manager         cert-manager-cainjector-54f68bfb64-m6x5f                          1/1     Running             0              94m
cert-manager         cert-manager-webhook-f6c8487d6-9x6x4                              1/1     Running             0              94m
istio-system         cluster-local-gateway-7bd9cffcb5-thdkb                            1/1     Running             0              94m
istio-system         configure-kubernetes-oidc-issuer-jwks-in-requestauthenticasxnfl   0/1     Completed           0              94m
istio-system         istio-ingressgateway-666f789ccb-wcqdc                             1/1     Running             0              94m
istio-system         istiod-6cd8c6c59c-htqzn                                           1/1     Running             0              94m
knative-eventing     eventing-controller-688dc8df9f-9fxpp                              1/1     Running             0              94m
knative-eventing     eventing-webhook-8c6cc5bc7-789xh                                  1/1     Running             0              94m
knative-serving      activator-55cd894f6c-dr9q4                                        1/1     Running             8 (36m ago)    94m
knative-serving      autoscaler-76748895b9-shk8t                                       2/2     Running             0              56m
knative-serving      controller-76dcf67d5-7tb5w                                        2/2     Running             0              56m
knative-serving      domain-mapping-f5d4dbc56-pbz5q                                    2/2     Running             0              56m
knative-serving      domainmapping-webhook-6f67684cd8-nlnsf                            2/2     Running             0              55m
knative-serving      net-istio-controller-7bb6fb5f58-tklxs                             2/2     Running             0              55m
knative-serving      net-istio-webhook-7d8476f6-svcjf                                  2/2     Running             0              55m
knative-serving      webhook-d5cbdf855-bzmsx                                           2/2     Running             0              55m
kube-system          coredns-565d847f94-cd9dp                                          1/1     Running             0              96m
kube-system          coredns-565d847f94-lc62z                                          1/1     Running             0              96m
kube-system          etcd-kubeflow-control-plane                                       1/1     Running             0              96m
kube-system          kindnet-qzthr                                                     1/1     Running             0              96m
kube-system          kube-apiserver-kubeflow-control-plane                             1/1     Running             0              96m
kube-system          kube-controller-manager-kubeflow-control-plane                    1/1     Running             0              96m
kube-system          kube-proxy-9zct2                                                  1/1     Running             0              96m
kube-system          kube-scheduler-kubeflow-control-plane                             1/1     Running             0              96m
kubeflow             admission-webhook-deployment-6cf44ffbdb-5m86s                     0/1     ContainerCreating   0              55m
kubeflow             cache-server-7d94c87787-88m4h                                     0/2     Init:0/1            0              55m
kubeflow             centraldashboard-965564b75-6frpk                                  2/2     Running             0              55m
kubeflow             jupyter-web-app-deployment-757976b798-7ngdb                       0/2     Pending             0              55m
kubeflow             katib-controller-64bf8db8bd-nfn2k                                 0/1     ContainerCreating   0              55m
kubeflow             katib-db-manager-6d6885765-tqldd                                  1/1     Running             7 (40m ago)    55m
kubeflow             katib-mysql-db6dc68c-q7hbt                                        1/1     Running             0              55m
kubeflow             katib-ui-64b8f8d78c-vxttm                                         2/2     Running             0              55m
kubeflow             kserve-controller-manager-6df96f6d7c-wwxct                        0/2     ContainerCreating   0              55m
kubeflow             kserve-models-web-app-99849d9f7-rmfhk                             2/2     Running             0              55m
kubeflow             kubeflow-pipelines-profile-controller-59ccbd47b9-7875s            1/1     Running             0              55m
kubeflow             metacontroller-0                                                  1/1     Running             0              94m
kubeflow             metadata-envoy-deployment-5cbbb86fc9-pwpbw                        1/1     Running             0              55m
kubeflow             metadata-grpc-deployment-784b8b5fb4-rqw94                         1/2     CrashLoopBackOff    10 (49s ago)   55m
kubeflow             metadata-writer-844bd5d486-nm2j6                                  2/2     Running             4 (69s ago)    55m
kubeflow             minio-65dff76b66-brflp                                            0/2     Pending             0              55m
kubeflow             ml-pipeline-6c7c86f666-qbs65                                      0/2     PodInitializing     0              55m
kubeflow             ml-pipeline-persistenceagent-85c485f86f-j8qwx                     0/2     PodInitializing     0              55m
kubeflow             ml-pipeline-scheduledworkflow-6448c96f4f-98997                    0/2     PodInitializing     0              55m
kubeflow             ml-pipeline-ui-6db56c647b-b6ksz                                   0/2     Pending             0              55m
kubeflow             ml-pipeline-viewer-crd-5df88b6956-kpt68                           0/2     Pending             0              55m
kubeflow             ml-pipeline-visualizationserver-6d49897f85-p9msj                  0/2     Pending             0              55m
kubeflow             mysql-c999c6c8-phg5s                                              0/2     Pending             0              55m
kubeflow             notebook-controller-deployment-9ffdf65d7-bsn6b                    0/2     PodInitializing     0              55m
kubeflow             profiles-deployment-cbf679dbd-qwskr                               0/3     PodInitializing     0              55m
kubeflow             pvcviewer-controller-manager-d66667b49-mhn4n                      0/3     Pending             0              55m
kubeflow             tensorboard-controller-deployment-7444dc8fcd-gxvfr                0/3     Pending             0              55m
kubeflow             tensorboards-web-app-deployment-78f7c694bf-tp8z9                  0/2     Pending             0              55m
kubeflow             training-operator-69575765df-v9hl4                                1/1     Running             0              55m
kubeflow             volumes-web-app-deployment-6dfccd897d-xklf7                       0/2     Pending             0              55m
kubeflow             workflow-controller-f65c9d9b4-m4f9k                               0/2     PodInitializing     0              55m
local-path-storage   local-path-provisioner-684f458cdd-nvs75                           1/1     Running             0              96m
oauth2-proxy         oauth2-proxy-58d95869bf-5n6l5                                     1/1     Running             0              94m
oauth2-proxy         oauth2-proxy-58d95869bf-684pn                                     1/1     Running             0              94m
juliusvonkohout commented 6 months ago

Can you try with the master branch as well? Please also check whether your install command is up to date in the master branch readme.md and follow the installation instructions with Kind as close as possible.

dnapier commented 6 months ago

I was able to resolve this by increasing the resources allocated to the machine. Was getting capped out by CPU, maybe you're facing similar?

biswajit-9776 commented 6 months ago

Can you try with the master branch as well? Please also check whether your install command is up to date in the master branch readme.md and follow the installation instructions with Kind as close as possible.

Hey @juliusvonkohout, yes my local machine's master branch is up to date.

biswajit-9776 commented 6 months ago

@dnapier Hi, I tried to increase CPU resources in the --kubeconfig file but it says there is no resources field in v1alpha4.Node. Could you please tell me what you tried?

dnapier commented 6 months ago

When I ran kubectl describe nodes, the cpu resources were maxed out. This was being done in a VM, so I simply added more cores to the machine. If you're doing the same and the core speeds are being limited by the host, you could raise the limit as well, but that was not the case for me.

image

I encountered another issue following this which was the activator of knative-serving crashing, but I do not believe that is related to the error you're seeing here.

juliusvonkohout commented 5 months ago

@dnapier Hi, I tried to increase CPU resources in the --kubeconfig file but it says there is no resources field in v1alpha4.Node. Could you please tell me what you tried?

CC @diegolovison then

diegolovison commented 5 months ago

Are you using kind with docker ?

ALPHA-1503 commented 5 months ago

Hello guys, I'm facing the same issues. I have to deploy Kubeflow for an Internship project and I have the same problem with Kubeflow v1.8 kustomize version : v5.3.0 cert-manager version : v0.12.1

After : "while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done" I get this error

Capture d'écran 2024-04-09 151931

My Kubernetes cluster is running with Tanzu.

juliusvonkohout commented 5 months ago

Please just test with Kind as explained in the readme.md in the master branch, to make sure that it is not a Kubernetes issue of your own cluster.

dnapier commented 5 months ago

Are you using kind with docker ?

Sorry, I didn't catch that this was addressed to me. Yes in my case, I am using kind with docker. Debian 12 host.

diegolovison commented 5 months ago

What is the amount of CPU and memory that you have available? Were you strictly following https://github.com/kubeflow/manifests/#installation

dnapier commented 5 months ago

12GB of memory on the system, 8 core processor (Intel(R) Xeon(R) E5-2620).

And yes I was strictly following the installation instructions.

ALPHA-1503 commented 5 months ago

Please just test with Kind as explained in the readme.md in the master branch, to make sure that it is not a Kubernetes issue of your own cluster.

I already tested the v1.8 on minikube and I'm facing the same issue...

diegolovison commented 5 months ago

12GB of memory on the system, 8 core processor (Intel(R) Xeon(R) E5-2620).

I believe you will need to have more resources. I have 20 cores and 36GB of memory

minikube and I'm facing the same issue...

I wasn't able to make it work on Minikube. Only with kind

ALPHA-1503 commented 5 months ago

I've just attempted to install it using a local kind cluster, but it didn't work. I'm encountering another issue... ! issue-kind-kf

dnapier commented 5 months ago

I've just attempted to install it using a local kind cluster, but it didn't work. I'm encountering another issue... ! issue-kind-kf

That's the exact issue I'm facing which @diegolovison is suggesting is caused from lack of available resources. I'm working on doubling my memory to 24GB to test if that resolves it. Will update asap.

ALPHA-1503 commented 5 months ago

Interesting.... I managed to install v1.8 on Minikube just now. I'm curious why it's working now. My suspicion is that I might encounter issues installing it on my Tanzu Cluster, perhaps due to a cluster-related problem.

dnapier commented 5 months ago

Interesting.... I managed to install v1.8 on Minikube just now. I'm curious why it's working now. My suspicion is that I might encounter issues installing it on my Tanzu Cluster, perhaps due to a cluster-related problem.

Do you mind sharing your cpu/memory for comparison?

ALPHA-1503 commented 5 months ago

8 Cores/16G

juliusvonkohout commented 5 months ago

minikube with podman worked for me with 16 GB if you strip down the example distribution down a bit. Otherwise you might need 32 GB. @diegolovison , we should add the memory and core requirements on top of the installation instructions with kind.

diegolovison commented 5 months ago

Do you believe that 32 GB and 20 cores?

juliusvonkohout commented 5 months ago

Do you believe that 32 GB and 20 cores?

I do not understand your question.

diegolovison commented 5 months ago

should we document that 32 GB of RAM and 20 CPU cores are the minimal to install Kubeflow locally?

dnapier commented 5 months ago

should we document that 32 GB of RAM and 20 CPU cores are the minimal to install Kubeflow locally?

Not that I have a say here, but I think that's a great idea.

juliusvonkohout commented 5 months ago

I would go with 16 cores and 32 GB memory as recommendation. Or are you sure that 16 cores are not enough? It is possible to do with way less, but that is then left up to the end user.

diegolovison commented 5 months ago

Ok. Sounds good

juliusvonkohout commented 4 months ago

@biswajit-9776 Please retry with the lastest master branch and readme. If you still encounter problems please open a new issue with our new template.