Applying cluster.yaml on v1.13.8: failed calling webhook "": connect: connection refused #14116

Closed maon-fp closed 2 weeks ago

maon-fp commented 3 weeks ago

I've upgraded rook from v1.10.11 to v1.13.8 step by step (v1.10.11 -> v1.11.11 -> v1.12.11 -> v1.13.8). On I've read that the admission controller is gone (which was enabled in my setup by ROOK_DISABLE_ADMISSION_CONTROLLER: "false"). So I changed this to ROOK_DISABLE_ADMISSION_CONTROLLER: "true" when still running v1.12.11. Upgrade to v1.13.8 went smoothly. Now I want to upgrade to Reef and try to apply the cluster.yaml. But this gives me:

rook $ kaf 04-cluster-prod.yaml
Error from server (InternalError): error when applying patch:
Resource: ", Resource=cephclusters", GroupVersionKind: ", Kind=CephCluster"
Name: "rook-ceph", Namespace: "rook-ceph"
for: "04-cluster-prod.yaml": error when patching "04-cluster-prod.yaml": Internal error occurred: failed calling webhook "": failed to call webhook: Post "https://rook-ceph-admission-controller.rook-ceph.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s": dial tcp connect: connection refused


maon-fp commented 3 weeks ago

My cluster.yaml: 04-cluster-prod.txt

maon-fp commented 3 weeks ago

Operator shows no errors or warnings.

travisn commented 3 weeks ago

@subhamkrai What are the steps to manually disable the admission controller? I can't seem to find it from previous issues.

subhamkrai commented 3 weeks ago

maon-fp commented 3 weeks ago

Thank you for your replies.

Are those supposed to be pods? I don't have any of those. I'm currently at v1.13.8: there is no ROOK_DISABLE_ADMISSION_CONTROLLER anymore. How can I set it to true now?

subhamkrai commented 3 weeks ago it was there till 1.12 and in 1.13 we removed it.

validating webhook rook-ceph-webhook this is not a pod, this is kubernetes resource, try kubectl get validatingwebhook rook-ceph-webhook*

maon-fp commented 3 weeks ago

@subhamkrai Thank you for pointing me in the right direction. I can see those resources:

$ kubectl api-resources --verbs=list -n rook-ceph | grep hook
mutatingwebhookconfigurations                false        MutatingWebhookConfiguration
validatingwebhookconfigurations              false        ValidatingWebhookConfiguration
$ kubectl api-resources --verbs=list -n rook-ceph | grep val
validatingwebhookconfigurations              false        ValidatingWebhookConfiguratio

So none of the ones you mentioned, or? it was there till 1.12 and in 1.13 we removed it.

So no chance to set it to true now?

subhamkrai commented 3 weeks ago

@maon-fp could you also share svc list in rook-ceoh namespace?

subhamkrai commented 3 weeks ago

Also could you share the top 10lines of rook operator pods logs

maon-fp commented 3 weeks ago

Yes, of course.

List of services:

$ kgs                                                                                                                                                         production:rook-ceph 
NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
csi-rbdplugin-metrics            ClusterIP    <none>        8080/TCP,8081/TCP   3y104d
rook-ceph-admission-controller   ClusterIP    <none>        443/TCP             2y2d
rook-ceph-mgr                    ClusterIP    <none>        9283/TCP            3y104d
rook-ceph-mgr-dashboard          ClusterIP   <none>        8443/TCP            3y104d
rook-ceph-mon-a                  ClusterIP    <none>        6789/TCP,3300/TCP   3y104d
rook-ceph-mon-c                  ClusterIP   <none>        6789/TCP,3300/TCP   3y104d
rook-ceph-mon-d                  ClusterIP    <none>        6789/TCP,3300/TCP   3y104d

First lines of operator log:

$ kl rook-ceph-operator-9f688fcc5-v2q6j | head -n 10                                                                                                          production:rook-ceph 
2024/04/23 14:00:19 maxprocs: Leaving GOMAXPROCS=24: CPU quota undefined
2024-04-23 14:00:19.215493 I | rookcmd: starting Rook v1.13.8 with arguments '/usr/local/bin/rook ceph operator'
2024-04-23 14:00:19.215514 I | rookcmd: flag values: --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-level=INFO
2024-04-23 14:00:19.215519 I | cephcmd: starting Rook-Ceph operator
2024-04-23 14:00:19.322061 I | cephcmd: base ceph version inside the rook operator image is "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)"
2024-04-23 14:00:19.332548 I | op-k8sutil: ROOK_CURRENT_NAMESPACE_ONLY="false" (env var)
2024-04-23 14:00:19.332558 I | operator: watching all namespaces for Ceph CRs
2024-04-23 14:00:19.332604 I | operator: setting up schemes
2024-04-23 14:00:19.335083 I | operator: setting up the controller-runtime manager
2024-04-23 14:00:19.335422 I | ceph-cluster-controller: successfully started
subhamkrai commented 3 weeks ago

logs didn't help much but yeah delete the following resources in rook-ceph namespace(probably)

 Certificate rook-admission-controller-cert
 Issuer "selfsigned-issuer"
service "rook-ceph-admission-controller"

Also if you could share the -o yaml output of certificate and issue mentioned above to make sure that you are deleting the right resources. But yes we need to clean above three resources.

maon-fp commented 3 weeks ago


kind: Certificate
  creationTimestamp: "2022-04-23T18:45:33Z"
  generation: 1
  name: rook-admission-controller-cert
  namespace: rook-ceph
  resourceVersion: "301286319"
  uid: 22aa348f-e223-4f98-870e-aab4ef1f71a9
  - rook-ceph-admission-controller
  - rook-ceph-admission-controller.rook-ceph.svc
  - rook-ceph-admission-controller.rook-ceph.svc.cluster.local
    kind: Issuer
    name: selfsigned-issuer
  secretName: rook-ceph-admission-controller
  - lastTransitionTime: "2022-04-23T18:45:34Z"
    message: Certificate is up to date and has not expired
    observedGeneration: 1
    reason: Ready
    status: "True"
    type: Ready
  notAfter: "2024-07-11T18:45:34Z"
  notBefore: "2024-04-12T18:45:34Z"
  renewalTime: "2024-06-11T18:45:34Z"
  revision: 13


kind: Issuer
  creationTimestamp: "2022-04-23T18:45:32Z"
  generation: 1
  name: selfsigned-issuer
  namespace: rook-ceph
  resourceVersion: "138597982"
  uid: 68162730-aade-4670-b830-1cf97005ef5c
  selfSigned: {}
  - lastTransitionTime: "2022-04-23T18:45:32Z"
    observedGeneration: 1
    reason: IsReady
    status: "True"
    type: Ready


apiVersion: v1
kind: Service
  creationTimestamp: "2022-04-23T18:45:34Z"
  name: rook-ceph-admission-controller
  namespace: rook-ceph
  resourceVersion: "214711462"
  uid: b62cac4d-ce0c-4f3d-aa19-ff2f9d9d553c
  internalTrafficPolicy: Cluster
  - IPv4
  ipFamilyPolicy: SingleStack
  - port: 443
    protocol: TCP
    targetPort: 9443
    app: rook-ceph-operator
  sessionAffinity: None
  type: ClusterIP
  loadBalancer: {}
maon-fp commented 3 weeks ago

I deleted those resources but still get (a slightly different) error:

Error from server (InternalError): error when applying patch:
Resource: ", Resource=cephclusters", GroupVersionKind: ", Kind=CephCluster"
Name: "rook-ceph", Namespace: "rook-ceph"
for: "04-cluster-prod.yaml": error when patching "04-cluster-prod.yaml": Internal error occurred: failed calling webhook "": failed to call webhook: Post "https://rook-ceph-admission-controller.rook-ceph.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s": service "rook-ceph-admission-controller" not found

I've also listed all resources in the namespace list_rook_ceph.txt and can find some admission controller resources:

$ grep admission list_rook_ceph.txt
secret/rook-ceph-admission-controller                          3      2y3d
secret/rook-ceph-admission-controller-token-s47d8   3      3y105d
serviceaccount/rook-ceph-admission-controller   1         3y105d
subhamkrai commented 3 weeks ago

try deleting the resources mentioned above

maon-fp commented 3 weeks ago

As stated before: the resource are already deleted. But now it complains about: service "rook-ceph-admission-controller" not found instead of a timeout.

subhamkrai commented 3 weeks ago

kubectl get validatingwebhookconfigurations -A (search this in all namespace once). Also I'm on holiday today so will look on Monday.

Edit: I hope it's not something blocking you

maon-fp commented 3 weeks ago

Thank you. Take your free time! I'm not really blocked.

$ kubectl get validatingwebhookconfigurations -A
NAME                            WEBHOOKS   AGE
cert-manager-webhook            1          3y116d
ingress-nginx-admission         1          432d
metallb-webhook-configuration   7          432d
rook-ceph-webhook               5          2y3d
subhamkrai commented 3 weeks ago

I see the issue you need to delete the rook-ceph-webhook (I forgot that webhooks are cluster based resouce) also here is the code that delete everything related to webhook in rook

maon-fp commented 3 weeks ago

Alright. I'm not into Go but I'll figure it out. Thank you for your help!

maon-fp commented 2 weeks ago

Just to be 100% sure. Are you asking to run:

kubectl delete validatingwebhookconfigurations rook-ceph-webhook

? I'm a bit worried as I can see 5 webhooks there.

subhamkrai commented 2 weeks ago

yess, delete rook-ceph-webhook only

maon-fp commented 2 weeks ago

It worked. Thanks a lot for the quick and competent answers! :bow:

subhamkrai commented 2 weeks ago

Good to know it is working now @maon-fp