Open dhruvapg opened 2 weeks ago
Deployed kubernetes cluster with 1.29.3 version Took etcd snapshot backup using etcdctl Deleted control plane nodes Restored etcd snapshot using etcdtl Invoked kubeadm init phase "kubeadm init phase mark-control-plane" step fails with clusterrolebinding already exists error
calling kubeadm init or join on an existing etcd data dir from /var/lib/etcd is not really supported or tested. so you might have to skip the mark-control-plane phase and manually apply what it does to workaround the problem.
the correct way to do this type of restore is to:
kubeadm join
a new CP nodein terms of why it's failing, i'm a bit confused. in 1.29 we already check if the CRB exists: https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L653C1-L657C13
and then we exit without an error: https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L669-L672
can you show the output of kubeadm init phase mark-control-plane --v=10
when the error is happening?
use pastebin or github gists to share the full output.
can you show the output of kubeadm init phase mark-control-plane --v=10 when the error is happening?
https://gist.github.com/dhruvapg/84d2d3b8cd0c81c114bf57db0b634281
Also attaching the output of the same command invoked with kubeadm config file:
root@422f37e8f2e2d83c5f4d6fd98e049586 [ ~ ]# kubeadm init phase mark-control-plane --config=/etc/k8s/kubeadm.yaml --rootfs=/ --v=10
https://gist.github.com/dhruvapg/cd403364523e0e300875450b3cfe6337
in terms of why it's failing, i'm a bit confused. in 1.29 we already check if the CRB exists: https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L653C1-L657C13
and then we exit without an error: https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L669-L672
In my case, during etcd restore, apiserver is configured to deny all APIs (except from privileged users/groups), since RBAC is disabled for cluster-admin.conf, it's failing when creating clusterolebinding with super-admin.conf client: https://github.com/kubernetes/kubernetes/blob/v1.29.6/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L697 And returning error from here: https://github.com/kubernetes/kubernetes/blob/v1.29.6/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L708
The CRB exists error for super-admin.conf is already fixed in 1.30 but not crossported to 1.29
thanks for the info.
In my case, during etcd restore, apiserver is configured to deny all APIs (except from privileged users/groups), since RBAC is disabled for cluster-admin.conf, it's failing when creating clusterolebinding with super-admin.conf client:
that's not a supported or tested scenario, but i can imagine users are doing similar actions. could you explain why are you restoring from backup in a similar way? is that an automated process supported in your stack or are you doing this as a one-off?
The CRB exists error for super-admin.conf is already fixed in 1.30 but not crossported to 1.29
so if we backport that PR to 1.29 it would be a fix for you?
here is the fix backport https://github.com/kubernetes/kubernetes/pull/125821
that can be available in the next 1.29.x if the release managers do not miss it.
Thanks for backporting the fix to 1.29.x
could you explain why are you restoring from backup in a similar way? is that an automated process supported in your stack or are you doing this as a one-off?
Restoring control plane to a backed up k8s version is a feature we began supporting, we disable webhooks and RBAC for apiserver until the restore operation is completed to avoid any undefined states. This was working fine till 1.28 since the default admin.conf was bound to system:masters Group that could bypass RBAC, it started breaking in 1.29 with the separation of cluster-admin.conf and super-admin.conf.
so if we backport that PR to 1.29 it would be a fix for you?
I haven't verified in 1.30, but I think it would fix the issue.
Earlier, I ran into the same error during kubeadm init phase upload-config all
, I could workaround it by deleting the clusterrolebinding but the same hack didn't work for kubeadm init phase mark-control-plane
So hopefully, this backport fix should help to handle apierrors.IsAlreadyExists error gracefully in these scenarios.
This was working fine till 1.28 since the default admin.conf was bound to system:masters Group that could bypass RBAC, it started breaking in 1.29 with the separation of cluster-admin.conf and super-admin.conf.
if the feature you support requires the system:masters
group which bypases RBAC you might have to maintain an admin.conf
that continues to bind to system:masters
. if you populate an admin.conf
, kubeadm init/join
will respect it, but kubeadm's cert rotation (e.g. on upgrade) will convert it back to a cluster-admin
role.
What keywords did you search in kubeadm issues before filing this one?
"kubeadm:cluster-admins" already exists unable to create the kubeadm:cluster-admins ClusterRoleBinding by using super-admin.conf
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
): v1.29.3Environment:
kubectl version
): v1.29.3uname -a
): Linux 422f37e8f2e2d83c5f4d6fd98e049586 5.10.216-1.ph4-esxWhat happened?
I'm hitting
"clusterrolebindings.rbac.authorization.k8s.io kubeadm:cluster-admins already exists"
error duringkubeadm init phase mark-control-plane
in K8s 1.29.3 version. Even if I delete clusterrolebinding manually, it gets created automatically by kubeadm in some sync loop and thenkubeadm init phase mark-control-plane
step fails with "clusterrolebinding already exists" error and is stuck in this error state.I0627 08:12:10.094815 538426 kubeconfig.go:606] ensuring that the ClusterRoleBinding for the kubeadm:cluster-admins Group exists I0627 08:12:10.102894 538426 kubeconfig.go:682] creating the ClusterRoleBinding for the kubeadm:cluster-admins Group by using super-admin.conf clusterrolebindings.rbac.authorization.k8s.io "kubeadm:cluster-admins" already exists unable to create the kubeadm:cluster-admins ClusterRoleBinding by using super-admin.conf k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.EnsureAdminClusterRoleBindingImpl cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:708 k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.EnsureAdminClusterRoleBinding cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:595 k8s.io/kubernetes/cmd/kubeadm/app/cmd.(*initData).Client cmd/kubeadm/app/cmd/init.go:526 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runMarkControlPlane cmd/kubeadm/app/cmd/phases/init/markcontrolplane.go:60 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:259 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll cmd/kubeadm/app/cmd/phases/workflow/runner.go:446 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run cmd/kubeadm/app/cmd/phases/workflow/runner.go:232 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).BindToCommand.func1.1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:372 github.com/spf13/cobra.(*Command).execute vendor/github.com/spf13/cobra/command.go:940 github.com/spf13/cobra.(*Command).ExecuteC vendor/github.com/spf13/cobra/command.go:1068 github.com/spf13/cobra.(*Command).Execute vendor/github.com/spf13/cobra/command.go:992 k8s.io/kubernetes/cmd/kubeadm/app.Run cmd/kubeadm/app/kubeadm.go:50 main.main cmd/kubeadm/kubeadm.go:25 runtime.main /usr/local/go/src/runtime/proc.go:267 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1650 could not bootstrap the admin user in file admin.conf k8s.io/kubernetes/cmd/kubeadm/app/cmd.(*initData).Client cmd/kubeadm/app/cmd/init.go:528 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runMarkControlPlane cmd/kubeadm/app/cmd/phases/init/markcontrolplane.go:60 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:259 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll cmd/kubeadm/app/cmd/phases/workflow/runner.go:446 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run cmd/kubeadm/app/cmd/phases/workflow/runner.go:232 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).BindToCommand.func1.1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:372 github.com/spf13/cobra.(*Command).execute vendor/github.com/spf13/cobra/command.go:940 github.com/spf13/cobra.(*Command).ExecuteC vendor/github.com/spf13/cobra/command.go:1068 github.com/spf13/cobra.(*Command).Execute vendor/github.com/spf13/cobra/command.go:992 k8s.io/kubernetes/cmd/kubeadm/app.Run cmd/kubeadm/app/kubeadm.go:50 main.main cmd/kubeadm/kubeadm.go:25 runtime.main
kubeapi server audit logs show that kubeadm:cluster-admins clusterolebinding is automatically created by kubeadm running on the new control plane node.
audit/kube-apiserver.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"RequestResponse","auditID":"81d9e046-37ae-4814-9e4b-56e87cc05c56","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?timeout=10s","verb":"create","user":{"username":"kubernetes-super-admin","groups":["system:masters","system:authenticated"]},"userAgent":"kubeadm/v1.29.3+(linux/amd64) kubernetes/4ab1a82","objectRef":{"resource":"clusterrolebindings","name":"kubeadm:cluster-admins","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestObject":{"kind":"ClusterRoleBinding","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"kubeadm:cluster-admins","creationTimestamp":null},"subjects":[{"kind":"Group","apiGroup":"rbac.authorization.k8s.io","name":"kubeadm:cluster-admins"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"cluster-admin"}},"responseObject":{"kind":"ClusterRoleBinding","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"kubeadm:cluster-admins","uid":"629da920-2bd3-4a98-9348-86708ccf6e4e","resourceVersion":"65240","creationTimestamp":"2024-06-27T06:54:24Z","managedFields":[{"manager":"kubeadm","operation":"Update","apiVersion":"rbac.authorization.k8s.io/v1","time":"2024-06-27T06:54:24Z","fieldsType":"FieldsV1","fieldsV1":{"f:roleRef":{},"f:subjects":{}}}]},"subjects":[{"kind":"Group","apiGroup":"rbac.authorization.k8s.io","name":"kubeadm:cluster-admins"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"cluster-admin"}},"requestReceivedTimestamp":"2024-06-27T06:54:24.611747Z","stageTimestamp":"2024-06-27T06:54:24.617174Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
What you expected to happen?
I expected mark-control-plane phase to handle clusterrolebinding already exists error gracefully and not return error. This issue is already fixed in 1.30 but not crossported to 1.29.
Is there a way to workaround this error before backporting https://github.com/kubernetes/kubernetes/commit/ec1516b45dc3d50fcfce87d7169a6d23e388c1b1 to 1.29 version ?
How to reproduce it (as minimally and precisely as possible)?
kubeadm init
phaseAnything else we need to know?
When I faced the same error during "kubeadm init phase upload-config all", I added "kubectl delete clusterrolebinding kubeadm:cluster-admins" command to delete it before the above step, then I was able to resolve the error and move to the next step. However, the same workaround is helpful during mark-control-plane phase.