kubernetes / kubeadm

Aggregator for issues filed against kubeadm
Apache License 2.0
3.69k stars 706 forks source link

Facing "kubeadm:cluster-admins already exists" error while running "kubeadm init phase mark-control-plane" step in K8s 1.29.3 version #3081

Open dhruvapg opened 2 weeks ago

dhruvapg commented 2 weeks ago

What keywords did you search in kubeadm issues before filing this one?

"kubeadm:cluster-admins" already exists unable to create the kubeadm:cluster-admins ClusterRoleBinding by using super-admin.conf

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): v1.29.3

Environment:

What happened?

I'm hitting "clusterrolebindings.rbac.authorization.k8s.io kubeadm:cluster-admins already exists" error during kubeadm init phase mark-control-plane in K8s 1.29.3 version. Even if I delete clusterrolebinding manually, it gets created automatically by kubeadm in some sync loop and then kubeadm init phase mark-control-plane step fails with "clusterrolebinding already exists" error and is stuck in this error state. I0627 08:12:10.094815 538426 kubeconfig.go:606] ensuring that the ClusterRoleBinding for the kubeadm:cluster-admins Group exists I0627 08:12:10.102894 538426 kubeconfig.go:682] creating the ClusterRoleBinding for the kubeadm:cluster-admins Group by using super-admin.conf clusterrolebindings.rbac.authorization.k8s.io "kubeadm:cluster-admins" already exists unable to create the kubeadm:cluster-admins ClusterRoleBinding by using super-admin.conf k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.EnsureAdminClusterRoleBindingImpl cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:708 k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.EnsureAdminClusterRoleBinding cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:595 k8s.io/kubernetes/cmd/kubeadm/app/cmd.(*initData).Client cmd/kubeadm/app/cmd/init.go:526 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runMarkControlPlane cmd/kubeadm/app/cmd/phases/init/markcontrolplane.go:60 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:259 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll cmd/kubeadm/app/cmd/phases/workflow/runner.go:446 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run cmd/kubeadm/app/cmd/phases/workflow/runner.go:232 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).BindToCommand.func1.1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:372 github.com/spf13/cobra.(*Command).execute vendor/github.com/spf13/cobra/command.go:940 github.com/spf13/cobra.(*Command).ExecuteC vendor/github.com/spf13/cobra/command.go:1068 github.com/spf13/cobra.(*Command).Execute vendor/github.com/spf13/cobra/command.go:992 k8s.io/kubernetes/cmd/kubeadm/app.Run cmd/kubeadm/app/kubeadm.go:50 main.main cmd/kubeadm/kubeadm.go:25 runtime.main /usr/local/go/src/runtime/proc.go:267 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1650 could not bootstrap the admin user in file admin.conf k8s.io/kubernetes/cmd/kubeadm/app/cmd.(*initData).Client cmd/kubeadm/app/cmd/init.go:528 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runMarkControlPlane cmd/kubeadm/app/cmd/phases/init/markcontrolplane.go:60 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:259 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll cmd/kubeadm/app/cmd/phases/workflow/runner.go:446 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run cmd/kubeadm/app/cmd/phases/workflow/runner.go:232 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).BindToCommand.func1.1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:372 github.com/spf13/cobra.(*Command).execute vendor/github.com/spf13/cobra/command.go:940 github.com/spf13/cobra.(*Command).ExecuteC vendor/github.com/spf13/cobra/command.go:1068 github.com/spf13/cobra.(*Command).Execute vendor/github.com/spf13/cobra/command.go:992 k8s.io/kubernetes/cmd/kubeadm/app.Run cmd/kubeadm/app/kubeadm.go:50 main.main cmd/kubeadm/kubeadm.go:25 runtime.main

kubeapi server audit logs show that kubeadm:cluster-admins clusterolebinding is automatically created by kubeadm running on the new control plane node. audit/kube-apiserver.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"RequestResponse","auditID":"81d9e046-37ae-4814-9e4b-56e87cc05c56","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?timeout=10s","verb":"create","user":{"username":"kubernetes-super-admin","groups":["system:masters","system:authenticated"]},"userAgent":"kubeadm/v1.29.3+(linux/amd64) kubernetes/4ab1a82","objectRef":{"resource":"clusterrolebindings","name":"kubeadm:cluster-admins","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestObject":{"kind":"ClusterRoleBinding","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"kubeadm:cluster-admins","creationTimestamp":null},"subjects":[{"kind":"Group","apiGroup":"rbac.authorization.k8s.io","name":"kubeadm:cluster-admins"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"cluster-admin"}},"responseObject":{"kind":"ClusterRoleBinding","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"kubeadm:cluster-admins","uid":"629da920-2bd3-4a98-9348-86708ccf6e4e","resourceVersion":"65240","creationTimestamp":"2024-06-27T06:54:24Z","managedFields":[{"manager":"kubeadm","operation":"Update","apiVersion":"rbac.authorization.k8s.io/v1","time":"2024-06-27T06:54:24Z","fieldsType":"FieldsV1","fieldsV1":{"f:roleRef":{},"f:subjects":{}}}]},"subjects":[{"kind":"Group","apiGroup":"rbac.authorization.k8s.io","name":"kubeadm:cluster-admins"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"cluster-admin"}},"requestReceivedTimestamp":"2024-06-27T06:54:24.611747Z","stageTimestamp":"2024-06-27T06:54:24.617174Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}

What you expected to happen?

I expected mark-control-plane phase to handle clusterrolebinding already exists error gracefully and not return error. This issue is already fixed in 1.30 but not crossported to 1.29.

Is there a way to workaround this error before backporting https://github.com/kubernetes/kubernetes/commit/ec1516b45dc3d50fcfce87d7169a6d23e388c1b1 to 1.29 version ?

How to reproduce it (as minimally and precisely as possible)?

  1. Deployed kubernetes cluster with 1.29.3 version
  2. Took etcd snapshot backup using etcdctl
  3. Deleted control plane nodes
  4. Restored etcd snapshot using etcdtl
  5. Invoked kubeadm init phase
  6. "kubeadm init phase mark-control-plane" step fails with clusterrolebinding already exists error

Anything else we need to know?

When I faced the same error during "kubeadm init phase upload-config all", I added "kubectl delete clusterrolebinding kubeadm:cluster-admins" command to delete it before the above step, then I was able to resolve the error and move to the next step. However, the same workaround is helpful during mark-control-plane phase.

neolit123 commented 2 weeks ago

Deployed kubernetes cluster with 1.29.3 version Took etcd snapshot backup using etcdctl Deleted control plane nodes Restored etcd snapshot using etcdtl Invoked kubeadm init phase "kubeadm init phase mark-control-plane" step fails with clusterrolebinding already exists error

calling kubeadm init or join on an existing etcd data dir from /var/lib/etcd is not really supported or tested. so you might have to skip the mark-control-plane phase and manually apply what it does to workaround the problem.

the correct way to do this type of restore is to:

in terms of why it's failing, i'm a bit confused. in 1.29 we already check if the CRB exists: https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L653C1-L657C13

and then we exit without an error: https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L669-L672

can you show the output of kubeadm init phase mark-control-plane --v=10 when the error is happening?

use pastebin or github gists to share the full output.

dhruvapg commented 2 weeks ago

can you show the output of kubeadm init phase mark-control-plane --v=10 when the error is happening?

https://gist.github.com/dhruvapg/84d2d3b8cd0c81c114bf57db0b634281

Also attaching the output of the same command invoked with kubeadm config file: root@422f37e8f2e2d83c5f4d6fd98e049586 [ ~ ]# kubeadm init phase mark-control-plane --config=/etc/k8s/kubeadm.yaml --rootfs=/ --v=10 https://gist.github.com/dhruvapg/cd403364523e0e300875450b3cfe6337

in terms of why it's failing, i'm a bit confused. in 1.29 we already check if the CRB exists: https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L653C1-L657C13

and then we exit without an error: https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L669-L672

In my case, during etcd restore, apiserver is configured to deny all APIs (except from privileged users/groups), since RBAC is disabled for cluster-admin.conf, it's failing when creating clusterolebinding with super-admin.conf client: https://github.com/kubernetes/kubernetes/blob/v1.29.6/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L697 And returning error from here: https://github.com/kubernetes/kubernetes/blob/v1.29.6/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L708

The CRB exists error for super-admin.conf is already fixed in 1.30 but not crossported to 1.29

neolit123 commented 2 weeks ago

thanks for the info.

In my case, during etcd restore, apiserver is configured to deny all APIs (except from privileged users/groups), since RBAC is disabled for cluster-admin.conf, it's failing when creating clusterolebinding with super-admin.conf client:

that's not a supported or tested scenario, but i can imagine users are doing similar actions. could you explain why are you restoring from backup in a similar way? is that an automated process supported in your stack or are you doing this as a one-off?

The CRB exists error for super-admin.conf is already fixed in 1.30 but not crossported to 1.29

so if we backport that PR to 1.29 it would be a fix for you?

neolit123 commented 2 weeks ago

here is the fix backport https://github.com/kubernetes/kubernetes/pull/125821

that can be available in the next 1.29.x if the release managers do not miss it.

dhruvapg commented 2 weeks ago

Thanks for backporting the fix to 1.29.x

could you explain why are you restoring from backup in a similar way? is that an automated process supported in your stack or are you doing this as a one-off?

Restoring control plane to a backed up k8s version is a feature we began supporting, we disable webhooks and RBAC for apiserver until the restore operation is completed to avoid any undefined states. This was working fine till 1.28 since the default admin.conf was bound to system:masters Group that could bypass RBAC, it started breaking in 1.29 with the separation of cluster-admin.conf and super-admin.conf.

so if we backport that PR to 1.29 it would be a fix for you?

I haven't verified in 1.30, but I think it would fix the issue. Earlier, I ran into the same error during kubeadm init phase upload-config all, I could workaround it by deleting the clusterrolebinding but the same hack didn't work for kubeadm init phase mark-control-plane So hopefully, this backport fix should help to handle apierrors.IsAlreadyExists error gracefully in these scenarios.

neolit123 commented 2 weeks ago

This was working fine till 1.28 since the default admin.conf was bound to system:masters Group that could bypass RBAC, it started breaking in 1.29 with the separation of cluster-admin.conf and super-admin.conf.

if the feature you support requires the system:masters group which bypases RBAC you might have to maintain an admin.conf that continues to bind to system:masters. if you populate an admin.conf, kubeadm init/join will respect it, but kubeadm's cert rotation (e.g. on upgrade) will convert it back to a cluster-admin role.