redhat-cop / cert-utils-operator

Set of functionalities around certificates packaged in a Kubernetes operator
Apache License 2.0
95 stars 35 forks source link

Manager Container restarts due to slow API #110

Closed jdziedzic closed 2 years ago

jdziedzic commented 2 years ago

We are seeing an issue where our manager containers restarts due to slow API response:

E0114 17:33:16.025717       1 leaderelection.go:325] error retrieving resource lock openshift-operators/b7831733.redhat.io: Get "https://redacted_ip:443/api/v1/namespaces/openshift-operators/configmaps/b7831733.redhat.io": context deadline exceeded

I0114 17:33:16.025982       1 leaderelection.go:278] failed to renew lease openshift-operators/b7831733.redhat.io: timed out waiting for the condition

This creates a downstream problem for us as cert-utils reconciles resources on startup and makes changes to the secrets which causes reloader (https://github.com/stakater/Reloader) to think the cert has changed, when it hasn't. This causes mass app pod restarts.

By default, the operator runs in a single instance so there really is no need for a leader-election process.

Would it be possible to do one of the following:

  1. Allow the timeout to be configurable through a CR (Give API server more time to respond)
  2. Disable leader-election through CR
  3. Is there a way to modify the behavior of cert-utils from changing secrets when it reconciles on startup?
raffaelespazzoli commented 2 years ago
  1. using the comfigmap as a leader election mechanism is part of the operator sdk (although now there is a way to upgrade to using Leases). The leader election configuration is captured in this structure: https://pkg.go.dev/k8s.io/component-base/config/v1alpha1#LeaderElectionConfiguration. In the code that structure is initialized here: https://github.com/redhat-cop/cert-utils-operator/blob/e2b8c74c7a15d32d18e589a5e56fac286a44a1db/main.go#L79

So at the moment the timeout is not configurable.

No one ever had an issue with the defaults. So while we should perhaps make the timeout configurable, you should also investigate why your api server is so slow.

  1. this is not an option

  2. I would like to better understand what happens in this situation cert-utils will set the secret how it's configured to do, but if they are already correctly set, it should not do anything. So perhaps you found a bug there, can you supply a reproducer?

jdziedzic commented 2 years ago

I've narrowed it down to what specifically is changing each time.

This particular cert is taking advantage of the feature cert-utils-operator.redhat-cop.io/generate-java-keystores: 'true'

It appears that on every restart, the keystore.jks value is being re-generated which causes not only that value to change, but the resourceVersion value as well. This is why reloader is restarting the pods. To recreate this you would just need a cert secret with the additional annotations:

cert-utils-operator.redhat-cop.io/java-keystore-password: password
cert-utils-operator.redhat-cop.io/generate-java-keystores: 'true'

Then restart the cert-utils pod.

raffaelespazzoli commented 2 years ago

ok, I looked at the code and I believe I understand where on create it updates the object even if nothing is changed. If I change the update to server side patch, this particular issue should go away. As a workaround, are you aware that cert-manager can generate truststores and keystores directly?

raffaelespazzoli commented 2 years ago

can you test if this branch fixes your issue: https://github.com/redhat-cop/cert-utils-operator/tree/fix%23110

jdziedzic commented 2 years ago

As a workaround, are you aware that cert-manager can generate truststores and keystores directly?

We're quite a bit behind on cert-manager versions as we can't migrate to a new version until all of our app teams update their certificate specs to the new apiVersion. Do you know what version this is available in?

jdziedzic commented 2 years ago

can you test if this branch fixes your issue: https://github.com/redhat-cop/cert-utils-operator/tree/fix%23110

Would it be possible to push an image to quay.io that I could update the CSV to, to test?

jdziedzic commented 2 years ago

I tried to run through the build, but all of our internal firewall and security mechanisms make it impossible to run the makefile successfully. Any help in building the image and publishing to quay.io would be greatly appreciated. Then I can test.

raffaelespazzoli commented 2 years ago

I did some more investigation on this problem. When you calculate a keystore starting from PEM-formatted keys, you get a different value every time (maybe there is timestamp somewhere in it). So the operator behaves correctly as it sees that it needs to update those values... So the solution is not that simple. I need some time to figure out a clean way to solve this. In the mean time, disregard the PR, it's not correct.