open-policy-agent / cert-controller

Apache License 2.0
90 stars 39 forks source link

Delay when the certs are mounted and available for use #35

Open aramase opened 3 years ago

aramase commented 3 years ago

I'm using the cert-controller in one of the projects to bootstrap a mutating webhook. I've configured the rotator using the example provided in the doc. Interestingly in most of the CI runs and local testing, I'm seeing a delay when the certs are available in the mount. Seeing it take upto 1m30s in few instances before the certs are ready in the mount path. The delay could be because the Kubernetes secret update is delayed and the mount republish is missed at the first attempt.

Is this a known behavior? Is that why there is RestartOnSecretRefresh property in struct?

    github.com/open-policy-agent/cert-controller v0.2.0
    k8s.io/kubernetes v1.21.2
    sigs.k8s.io/controller-runtime v0.9.2

Usage:

    // Make sure certs are generated and valid if cert rotation is enabled.
    setupFinished := make(chan struct{})
    if !disableCertRotation {
        entryLog.Info("setting up cert rotation")
        if err := rotator.AddRotator(mgr, &rotator.CertRotator{
            SecretKey: types.NamespacedName{
                Namespace: util.GetNamespace(),
                Name:      secretName,
            },
            CertDir:        webhookCertDir,
            CAName:         caName,
            CAOrganization: caOrganization,
            DNSName:        dnsName,
            IsReady:        setupFinished,
            Webhooks:       webhooks,
        }); err != nil {
            entryLog.Error(err, "unable to set up cert rotation")
            os.Exit(1)
        }
    } else {
        close(setupFinished)
    }
aramase commented 3 years ago

cc @maxsmythe @adrianludwin

adrianludwin commented 3 years ago

Yes, this is exactly why RestartOnSecretRefresh exists. Kubelets only check for new secrets occasionally and it can take a significant amount of time for them to become visible to the pod, but a pod restart fixes the problem almost instantly. I did think it would take less than 90s, but I had at least 30s in my mind. Max might have a better idea than me.

On Mon, Jul 19, 2021 at 4:51 PM Anish Ramasekar @.***> wrote:

cc @maxsmythe https://github.com/maxsmythe @adrianludwin https://github.com/adrianludwin

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-policy-agent/cert-controller/issues/35#issuecomment-882850426, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE43PZBYNOHWFBOZELDANWDTYSF4RANCNFSM5AUOI6WQ .

aramase commented 3 years ago

@adrianludwin thanks for the response. I haven't been able to repro this in gatekeeper. The behavior is mostly deterministic and the certs are available within few seconds after startup. I wonder why there is a difference in behavior for other projects that take up dep.

adrianludwin commented 3 years ago

IIRC Gatekeeper always uses the restart strategy but I'm not sure. Sorry, this is the limit of my knowledge!

On Mon, 19 Jul 2021, 6:10 pm Anish Ramasekar, @.***> wrote:

@adrianludwin https://github.com/adrianludwin thanks for the response. I haven't been able to repro this gatekeeper. The behavior is mostly deterministic and the certs are available within few seconds after startup. I wonder why there is a difference in behavior for other projects that take up dep.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-policy-agent/cert-controller/issues/35#issuecomment-882894547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE43PZGUW4BSROELLDTL7KTTYSPFTANCNFSM5AUOI6WQ .

maxsmythe commented 3 years ago

G8r doesn't currently use the restart strategy.

One potential thing that helps g8r: there are multiple pods invoking cert rotator, so it's possible that the secret is only missing for the first pod, while other pods only start after the first pod has written out a cert.

adrianludwin commented 3 years ago

TIL. Thanks Max!

On Mon, Jul 19, 2021 at 6:33 PM Max Smythe @.***> wrote:

G8r doesn't currently use the restart strategy.

One potential thing that helps g8r: there are multiple pods invoking cert rotator, so it's possible that the secret is only missing for the first pod, while other pods only start after the first pod has written out a cert.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-policy-agent/cert-controller/issues/35#issuecomment-882904859, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE43PZE32L7SAKLCZT45RVTTYSR25ANCNFSM5AUOI6WQ .

aramase commented 3 years ago

One potential thing that helps g8r: there are multiple pods invoking cert rotator, so it's possible that the secret is only missing for the first pod, while other pods only start after the first pod has written out a cert.

Thanks Max! I wondered the same and tried with replicas: 1 but haven't been able to repro this issue with gatekeeper.

maxsmythe commented 3 years ago

Weird, not sure why then. The observation that the secret is updated and therefore the mounted file should be updated is handled by K8s itself, so not sure there is anything we can do to influence that. If you want to see how G8r uses cert manager, it lives here:

https://github.com/open-policy-agent/gatekeeper/blob/master/main.go