openshift / cluster-kube-controller-manager-operator

The kube-controller-manager operator installs and maintains the kube-controller-manager on a cluster
Apache License 2.0
43 stars 125 forks source link

kcm rootCA missing apiserver ca leads to kube-root-ca problem #702

Open lance5890 opened 1 year ago

lance5890 commented 1 year ago

1 Bug phenomenon

  1. The kcm(kube-controller-manager) rootCA is generated by manageServiceAccountCABundle in targerconfigcontroller, and this func will get kube-apiserver-server-ca cm first , and then use it and other two cm to generate kcm rootCA
  2. when sometimes(very unlikely to happen , but i have met this just one time) the kube-apiserver-server-ca cm is missing, and the manageServiceAccountCABundle generate rootCA without kube-apiserver-server-ca , and finally the kcm leader holds the wrong rootCA, it will lead to the kube-root-ca problem in every pod, and the ocp release wil not work until i stop the wrong kcm leader

2 Bug fix: 2.1. this bug can be resolved by adding kube-apiserver-server-ca check in manageServiceAccountCABundle of targerconfigcontroller ,as follows https://github.com/openshift/cluster-kube-controller-manager-operator/blob/4ca346ef97def3697f1aa0368c9b35459b9b2f59/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L706-L720

image

2.1 maybe like this , but this is not the best way to resolve this problem

image

2.3 this is also can be resolved by modifying the openshift library func CombineCABundleConfigMaps in resourcesynccontroller as https://github.com/openshift/cluster-kube-controller-manager-operator/blob/4ca346ef97def3697f1aa0368c9b35459b9b2f59/vendor/github.com/openshift/library-go/pkg/operator/resourcesynccontroller/core.go#L17-L67

image

3 This is related to the openshift library-go issue # issue 1472 github.com/openshift/library-go missing key configmap

lance5890 commented 1 year ago

@ingvagabund @atiratree @deads2k @soltysh

openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

lance5890 commented 1 year ago

/remove-lifecycle stale

lance5890 commented 1 year ago

/lifecycle frozen