rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
11.98k stars 2.64k forks source link

Applying cluster.yaml on v1.13.8: failed calling webhook "cephcluster-wh-rook-ceph-admission-controller-rook-ceph.rook.io": connect: connection refused #14116

Closed maon-fp closed 2 weeks ago

maon-fp commented 3 weeks ago

I've upgraded rook from v1.10.11 to v1.13.8 step by step (v1.10.11 -> v1.11.11 -> v1.12.11 -> v1.13.8). On https://rook.github.io/docs/rook/v1.13/Upgrade/rook-upgrade/ I've read that the admission controller is gone (which was enabled in my setup by ROOK_DISABLE_ADMISSION_CONTROLLER: "false"). So I changed this to ROOK_DISABLE_ADMISSION_CONTROLLER: "true" when still running v1.12.11. Upgrade to v1.13.8 went smoothly. Now I want to upgrade to Reef and try to apply the cluster.yaml. But this gives me:

rook $ kaf 04-cluster-prod.yaml
Error from server (InternalError): error when applying patch:
{"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"ceph.rook.io/v1\",\"kind\":\"CephCluster\",\"metadata\":{\"annotations\":{},\"name\":\"rook-ceph\",\"namespace\":\"rook-ceph\"},\"spec\":{\"annotations\":null,\"cephVersion\":{\"allowUnsupported\":false,\"image\":\"quay.io/ceph/ceph:v18.2.2\"},\"cleanupPolicy\":{\"allowUninstallWithVolumes\":false,\"confirmation\":\"\",\"sanitizeDisks\":{\"dataSource\":\"zero\",\"iteration\":1,\"method\":\"quick\"}},\"continueUpgradeAfterChecksEvenIfNotHealthy\":false,\"crashCollector\":{\"disable\":false},\"csi\":{\"cephfs\":null,\"readAffinity\":{\"enabled\":false}},\"dashboard\":{\"enabled\":true,\"ssl\":true},\"dataDirHostPath\":\"/var/lib/rook\",\"disruptionManagement\":{\"managePodBudgets\":true,\"osdMaintenanceTimeout\":30,\"pgHealthCheckTimeout\":0},\"healthCheck\":{\"daemonHealth\":{\"mon\":{\"disabled\":false,\"interval\":\"45s\"},\"osd\":{\"disabled\":false,\"interval\":\"60s\"},\"status\":{\"disabled\":false,\"interval\":\"60s\"}},\"livenessProbe\":{\"mgr\":{\"disabled\":false},\"mon\":{\"disabled\":false},\"osd\":{\"disabled\":false}},\"startupProbe\":{\"mgr\":{\"disabled\":false},\"mon\":{\"disabled\":false},\"osd\":{\"disabled\":false}}},\"labels\":null,\"logCollector\":{\"enabled\":true,\"maxLogSize\":\"500M\",\"periodicity\":\"daily\"},\"mgr\":{\"allowMultiplePerNode\":true,\"count\":2,\"modules\":null},\"mon\":{\"allowMultiplePerNode\":true,\"count\":3},\"monitoring\":{\"enabled\":false,\"metricsDisabled\":false},\"network\":{\"connections\":{\"compression\":{\"enabled\":false},\"encryption\":{\"enabled\":false},\"requireMsgr2\":false}},\"priorityClassNames\":{\"mgr\":\"system-cluster-critical\",\"mon\":\"system-node-critical\",\"osd\":\"system-node-critical\"},\"removeOSDsIfOutAndSafeToRemove\":false,\"resources\":null,\"skipUpgradeChecks\":false,\"storage\":{\"config\":null,\"nodes\":[{\"devices\":[{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme0n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme1n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme3n1\"}],\"name\":\"storage1.<redacted>\"},{\"devices\":[{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme0n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme2n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme3n1\"}],\"name\":\"storage2.<redacted>\"}],\"onlyApplyOSDPlacement\":false,\"useAllDevices\":false,\"useAllNodes\":false},\"waitTimeoutForHealthyOSDInMinutes\":10}}\n"}},"spec":{"cephVersion":{"image":"quay.io/ceph/ceph:v18.2.2"},"csi":{"cephfs":null,"readAffinity":{"enabled":false}},"mgr":{"modules":null}}}
to:
Resource: "ceph.rook.io/v1, Resource=cephclusters", GroupVersionKind: "ceph.rook.io/v1, Kind=CephCluster"
Name: "rook-ceph", Namespace: "rook-ceph"
for: "04-cluster-prod.yaml": error when patching "04-cluster-prod.yaml": Internal error occurred: failed calling webhook "cephcluster-wh-rook-ceph-admission-controller-rook-ceph.rook.io": failed to call webhook: Post "https://rook-ceph-admission-controller.rook-ceph.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s": dial tcp 10.99.221.127:443: connect: connection refused

Environment:

maon-fp commented 3 weeks ago

My cluster.yaml: 04-cluster-prod.txt

maon-fp commented 3 weeks ago

Operator shows no errors or warnings.

travisn commented 3 weeks ago

@subhamkrai What are the steps to manually disable the admission controller? I can't seem to find it from previous issues.

subhamkrai commented 3 weeks ago

@subhamkrai What are the steps to manually disable the admission controller? I can't seem to find it from previous issues.

not exactly remember but setting the value true should work but if that is not working try deleting

validating webhook rook-ceph-webhook

@maon-fp

maon-fp commented 3 weeks ago

Thank you for your replies.

not exactly remember but setting the value true should work but if that is not working try deleting

validating webhook rook-ceph-webhook

Are those supposed to be pods? I don't have any of those. I'm currently at v1.13.8: there is no ROOK_DISABLE_ADMISSION_CONTROLLER anymore. How can I set it to true now?

subhamkrai commented 3 weeks ago

https://github.com/rook/rook/blob/release-1.12/deploy/examples/operator.yaml#L509 it was there till 1.12 and in 1.13 we removed it.

validating webhook rook-ceph-webhook this is not a pod, this is kubernetes resource, try kubectl get validatingwebhook rook-ceph-webhook*

maon-fp commented 3 weeks ago

@subhamkrai Thank you for pointing me in the right direction. I can see those resources:

$ kubectl api-resources --verbs=list -n rook-ceph | grep hook
mutatingwebhookconfigurations                       admissionregistration.k8s.io/v1   false        MutatingWebhookConfiguration
validatingwebhookconfigurations                     admissionregistration.k8s.io/v1   false        ValidatingWebhookConfiguration
$ kubectl api-resources --verbs=list -n rook-ceph | grep val
validatingwebhookconfigurations                     admissionregistration.k8s.io/v1   false        ValidatingWebhookConfiguratio

So none of the ones you mentioned, or?

https://github.com/rook/rook/blob/release-1.12/deploy/examples/operator.yaml#L509 it was there till 1.12 and in 1.13 we removed it.

So no chance to set it to true now?

subhamkrai commented 3 weeks ago

@maon-fp could you also share svc list in rook-ceoh namespace?

subhamkrai commented 3 weeks ago

Also could you share the top 10lines of rook operator pods logs

maon-fp commented 3 weeks ago

Yes, of course.

List of services:

$ kgs                                                                                                                                                         production:rook-ceph 
NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
csi-rbdplugin-metrics            ClusterIP   10.104.212.46    <none>        8080/TCP,8081/TCP   3y104d
rook-ceph-admission-controller   ClusterIP   10.99.221.127    <none>        443/TCP             2y2d
rook-ceph-mgr                    ClusterIP   10.109.30.124    <none>        9283/TCP            3y104d
rook-ceph-mgr-dashboard          ClusterIP   10.107.242.106   <none>        8443/TCP            3y104d
rook-ceph-mon-a                  ClusterIP   10.101.39.245    <none>        6789/TCP,3300/TCP   3y104d
rook-ceph-mon-c                  ClusterIP   10.110.130.143   <none>        6789/TCP,3300/TCP   3y104d
rook-ceph-mon-d                  ClusterIP   10.110.86.107    <none>        6789/TCP,3300/TCP   3y104d

First lines of operator log:


$ kl rook-ceph-operator-9f688fcc5-v2q6j | head -n 10                                                                                                          production:rook-ceph 
2024/04/23 14:00:19 maxprocs: Leaving GOMAXPROCS=24: CPU quota undefined
2024-04-23 14:00:19.215493 I | rookcmd: starting Rook v1.13.8 with arguments '/usr/local/bin/rook ceph operator'
2024-04-23 14:00:19.215514 I | rookcmd: flag values: --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-level=INFO
2024-04-23 14:00:19.215519 I | cephcmd: starting Rook-Ceph operator
2024-04-23 14:00:19.322061 I | cephcmd: base ceph version inside the rook operator image is "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)"
2024-04-23 14:00:19.332548 I | op-k8sutil: ROOK_CURRENT_NAMESPACE_ONLY="false" (env var)
2024-04-23 14:00:19.332558 I | operator: watching all namespaces for Ceph CRs
2024-04-23 14:00:19.332604 I | operator: setting up schemes
2024-04-23 14:00:19.335083 I | operator: setting up the controller-runtime manager
2024-04-23 14:00:19.335422 I | ceph-cluster-controller: successfully started
``
subhamkrai commented 3 weeks ago

logs didn't help much but yeah delete the following resources in rook-ceph namespace(probably)

 Certificate rook-admission-controller-cert
 Issuer "selfsigned-issuer"
service "rook-ceph-admission-controller"

Also if you could share the -o yaml output of certificate and issue mentioned above to make sure that you are deleting the right resources. But yes we need to clean above three resources.

maon-fp commented 3 weeks ago

rook-admission-controller-cert:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  creationTimestamp: "2022-04-23T18:45:33Z"
  generation: 1
  name: rook-admission-controller-cert
  namespace: rook-ceph
  resourceVersion: "301286319"
  uid: 22aa348f-e223-4f98-870e-aab4ef1f71a9
spec:
  dnsNames:
  - rook-ceph-admission-controller
  - rook-ceph-admission-controller.rook-ceph.svc
  - rook-ceph-admission-controller.rook-ceph.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: selfsigned-issuer
  secretName: rook-ceph-admission-controller
status:
  conditions:
  - lastTransitionTime: "2022-04-23T18:45:34Z"
    message: Certificate is up to date and has not expired
    observedGeneration: 1
    reason: Ready
    status: "True"
    type: Ready
  notAfter: "2024-07-11T18:45:34Z"
  notBefore: "2024-04-12T18:45:34Z"
  renewalTime: "2024-06-11T18:45:34Z"
  revision: 13

selfsigned-issuer:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  creationTimestamp: "2022-04-23T18:45:32Z"
  generation: 1
  name: selfsigned-issuer
  namespace: rook-ceph
  resourceVersion: "138597982"
  uid: 68162730-aade-4670-b830-1cf97005ef5c
spec:
  selfSigned: {}
status:
  conditions:
  - lastTransitionTime: "2022-04-23T18:45:32Z"
    observedGeneration: 1
    reason: IsReady
    status: "True"
    type: Ready

rook-ceph-admission-controller:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-04-23T18:45:34Z"
  name: rook-ceph-admission-controller
  namespace: rook-ceph
  resourceVersion: "214711462"
  uid: b62cac4d-ce0c-4f3d-aa19-ff2f9d9d553c
spec:
  clusterIP: 10.99.221.127
  clusterIPs:
  - 10.99.221.127
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 443
    protocol: TCP
    targetPort: 9443
  selector:
    app: rook-ceph-operator
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
maon-fp commented 3 weeks ago

I deleted those resources but still get (a slightly different) error:

Error from server (InternalError): error when applying patch:
{"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"ceph.rook.io/v1\",\"kind\":\"CephCluster\",\"metadata\":{\"annotations\":{},\"name\":\"rook-ceph\",\"namespace\":\"rook-ceph\"},\"spec\":{\"annotations\":null,\"cephVersion\":{\"allowUnsupported\":false,\"image\":\"quay.io/ceph/ceph:v18.2.2\"},\"cleanupPolicy\":{\"allowUninstallWithVolumes\":false,\"confirmation\":\"\",\"sanitizeDisks\":{\"dataSource\":\"zero\",\"iteration\":1,\"method\":\"quick\"}},\"continueUpgradeAfterChecksEvenIfNotHealthy\":false,\"crashCollector\":{\"disable\":false},\"csi\":{\"cephfs\":null,\"readAffinity\":{\"enabled\":false}},\"dashboard\":{\"enabled\":true,\"ssl\":true},\"dataDirHostPath\":\"/var/lib/rook\",\"disruptionManagement\":{\"managePodBudgets\":true,\"osdMaintenanceTimeout\":30,\"pgHealthCheckTimeout\":0},\"healthCheck\":{\"daemonHealth\":{\"mon\":{\"disabled\":false,\"interval\":\"45s\"},\"osd\":{\"disabled\":false,\"interval\":\"60s\"},\"status\":{\"disabled\":false,\"interval\":\"60s\"}},\"livenessProbe\":{\"mgr\":{\"disabled\":false},\"mon\":{\"disabled\":false},\"osd\":{\"disabled\":false}},\"startupProbe\":{\"mgr\":{\"disabled\":false},\"mon\":{\"disabled\":false},\"osd\":{\"disabled\":false}}},\"labels\":null,\"logCollector\":{\"enabled\":true,\"maxLogSize\":\"500M\",\"periodicity\":\"daily\"},\"mgr\":{\"allowMultiplePerNode\":true,\"count\":2,\"modules\":null},\"mon\":{\"allowMultiplePerNode\":true,\"count\":3},\"monitoring\":{\"enabled\":false,\"metricsDisabled\":false},\"network\":{\"connections\":{\"compression\":{\"enabled\":false},\"encryption\":{\"enabled\":false},\"requireMsgr2\":false}},\"priorityClassNames\":{\"mgr\":\"system-cluster-critical\",\"mon\":\"system-node-critical\",\"osd\":\"system-node-critical\"},\"removeOSDsIfOutAndSafeToRemove\":false,\"resources\":null,\"skipUpgradeChecks\":false,\"storage\":{\"config\":null,\"nodes\":[{\"devices\":[{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme0n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme1n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme3n1\"}],\"name\":\"storage1.<redacted>\"},{\"devices\":[{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme0n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme2n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme3n1\"}],\"name\":\"storage2.<redacted>\"}],\"onlyApplyOSDPlacement\":false,\"useAllDevices\":false,\"useAllNodes\":false},\"waitTimeoutForHealthyOSDInMinutes\":10}}\n"}},"spec":{"cephVersion":{"image":"quay.io/ceph/ceph:v18.2.2"},"csi":{"cephfs":null,"readAffinity":{"enabled":false}},"mgr":{"modules":null}}}
to:
Resource: "ceph.rook.io/v1, Resource=cephclusters", GroupVersionKind: "ceph.rook.io/v1, Kind=CephCluster"
Name: "rook-ceph", Namespace: "rook-ceph"
for: "04-cluster-prod.yaml": error when patching "04-cluster-prod.yaml": Internal error occurred: failed calling webhook "cephcluster-wh-rook-ceph-admission-controller-rook-ceph.rook.io": failed to call webhook: Post "https://rook-ceph-admission-controller.rook-ceph.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s": service "rook-ceph-admission-controller" not found

I've also listed all resources in the namespace list_rook_ceph.txt and can find some admission controller resources:

$ grep admission list_rook_ceph.txt
secret/rook-ceph-admission-controller               kubernetes.io/tls                     3      2y3d
secret/rook-ceph-admission-controller-token-s47d8   kubernetes.io/service-account-token   3      3y105d
serviceaccount/rook-ceph-admission-controller   1         3y105d
subhamkrai commented 3 weeks ago

try deleting the resources mentioned above

maon-fp commented 3 weeks ago

As stated before: the resource are already deleted. But now it complains about: service "rook-ceph-admission-controller" not found instead of a timeout.

subhamkrai commented 3 weeks ago

kubectl get validatingwebhookconfigurations -A (search this in all namespace once). Also I'm on holiday today so will look on Monday.

Edit: I hope it's not something blocking you

maon-fp commented 3 weeks ago

Thank you. Take your free time! I'm not really blocked.

$ kubectl get validatingwebhookconfigurations -A
NAME                            WEBHOOKS   AGE
cert-manager-webhook            1          3y116d
ingress-nginx-admission         1          432d
metallb-webhook-configuration   7          432d
rook-ceph-webhook               5          2y3d
subhamkrai commented 3 weeks ago

Thank you. Take your free time! I'm not really blocked.

$ kubectl get validatingwebhookconfigurations -A
NAME                            WEBHOOKS   AGE
cert-manager-webhook            1          3y116d
ingress-nginx-admission         1          432d
metallb-webhook-configuration   7          432d
rook-ceph-webhook               5          2y3d

I see the issue you need to delete the rook-ceph-webhook (I forgot that webhooks are cluster based resouce) also here is the code https://github.com/rook/rook/blob/b32948c314d64f6b48e40f32d5df656b33d921d1/pkg/operator/ceph/webhook-config.go#L258-L282 that delete everything related to webhook in rook

maon-fp commented 3 weeks ago

Alright. I'm not into Go but I'll figure it out. Thank you for your help!

maon-fp commented 2 weeks ago

Just to be 100% sure. Are you asking to run:

kubectl delete validatingwebhookconfigurations rook-ceph-webhook

? I'm a bit worried as I can see 5 webhooks there.

subhamkrai commented 2 weeks ago

Just to be 100% sure. Are you asking to run:

kubectl delete validatingwebhookconfigurations rook-ceph-webhook

? I'm a bit worried as I can see 5 webhooks there.

yess, delete rook-ceph-webhook only

maon-fp commented 2 weeks ago

It worked. Thanks a lot for the quick and competent answers! :bow:

subhamkrai commented 2 weeks ago

Good to know it is working now @maon-fp