Cache doesn't start if it's config refers to types that are not installed

tiraboschi commented 11 months ago

controller-runtime v0.16.0 fails starting the manager when the operator is using cache.Options with kinds identified ByObject that are not present on the specific cluster.

This could happen for instance trying to finetune upfront the cache also for monitoring kinds when the prometheous stack is not deployed or OKD/OCP specific kinds when on vanilla k8s.

Now the operator dies with something like:

failed to determine if *v1.ConsoleQuickStart is namespaced: failed to get restmapping: failed to find API group "console.openshift.io"","stacktrace":"github.com/kubevirt/hyperconverged-cluster-operator/cmd/cmdcommon.HcCmdHelper.ExitOnError
    /go/src/github.com/kubevirt/hyperconverged-cluster-operator/cmd/cmdcommon/cmdcommon.go:97
main.main
    /go/src/github.com/kubevirt/hyperconverged-cluster-operator/cmd/hyperconverged-cluster-operator/main.go:104
runtime.main
    /usr/local/go/src/runtime/proc.go:250

the same code was working with controller-runtime <= v0.15.0.

This is probably got probably changed with: https://github.com/kubernetes-sigs/controller-runtime/pull/2421 The error comes from https://github.com/kubernetes-sigs/controller-runtime/blob/c20ea143a236a34fb331e6c04820b75aac444e7d/pkg/cache/cache.go#L399-L405 that is supposed to use isNamespaced during cache initialization to ensure that byObject.Namespaces is not set for cluster-scope kind.

A simple fix is just about skipping that additional check for non existing kinds instead of breaking the cache initialization and so the manager one.

tiraboschi commented 11 months ago

@alvaroaleman FYI

alvaroaleman commented 11 months ago

So you have kinds in your cache config that you don't watch, since you say the config worked in 0.15? That doesn't seem correct

troy0820 commented 11 months ago

/kind bug support

alvaroaleman commented 11 months ago

/retitle Cache doesn't start if it's config refers to types that are not installed

tiraboschi commented 11 months ago

So you have kinds in your cache config that you don't watch, since you say the config worked in 0.15? That doesn't seem correct

No, not really. During its initialization our operator is using the APIReader client (whis is not cache based) to try getting specific "optional" (like OKD/OCP console kinds since we have a dynamic plugin for the UI or the monitoring stack that can be skipped on low footprint environments) resources and according to the eventual error we detect a missing kind.

Based on those results, we add the "optional" kinds to the list of watched ones: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/ae66846d229c1500aadcf48cb24b2e95b2a0547d/controllers/hyperconverged/hyperconverged_controller.go#L190-L207

This was working in v0.15 and earlier, but now it's not working with v0.16 since we fail the initialization of the manager due to the missing kinds in cache sanity checks.

A possible workaround on our side is bootstrapping a first manager with no custom cache (or using another kind of client just for that), detect the presence of the optional kinds earlier and use its results to bootstrap the real manager we are going to use with a cache configured only for existing kinds but this sounds overkill to me.

alvaroaleman commented 11 months ago

Well if you already have code to detect presence and create watches, you can use that same code to conditionally add cache config for those types depending on presence

tiraboschi commented 11 months ago

Well if you already have code to detect presence and create watches, you can use that same code to conditionally add cache config for those types depending on presence

But we have another issue here: currently our detect code is using the non-caching client we get from mgr.GetAPIReader(), but with v0.16 we are not able to reach that stage single we fail earlier in mgr, err := manager.New( when we try to instantiate the manager with a cache config referring missing kinds.

Can we instantiate the manager with the default cache and configure a custom one only in a later stage?

alvaroaleman commented 11 months ago

You don't need a manager to get a client. You can use client.New to get a client

tiraboschi commented 11 months ago

Thanks, it correctly worked for our case.

troy0820 commented 11 months ago

Can this be closed? /remove-kind bug

tiraboschi commented 11 months ago

Can this be closed?

I fear we are not going the only one hitting this. If we cannot really prevent it, I think we should at least document it somewhere.

alvaroaleman commented 11 months ago

I fear we are not going the only one hitting this.

Any data to back that up? IMHO setting up config for types you don't actually use is pretty questionable.

tiraboschi commented 11 months ago

No, honestly I was just guessing. But here the point is not that we want to config the cache for kinds that we are not going to use but that we do not want to fail when we try to configure the cache for "optional" kinds (like PrometheusRule) that could potentially not be available at runtime on the cluster if the cluster admin decided to not install a "soft-dependency" (like the monitoring stack to keep the cluster small).

vincepri commented 11 months ago

To clarify the ask, is to handle "optional" reconcilers/controllers that might or might not be installed in the cluster?

tiraboschi commented 11 months ago

Yes, correct: let's say for example that we want to reconcile a set of PrometheusRules relative to our sw package but just in the case the cluster admin decided to deploy also a monitoring stack on his cluster.

vincepri commented 11 months ago

For now controller runtime requires every CRD to be available and served before any reconciler or cache is running. The ByObject struct might allow a conditional use case like you're describing, but there are extensive changes that need to happen in case this isn't a "startup problem", but rather you might want to react to changes once the CRDs are installed.

Has this problem solved somehow downstream yet?

tiraboschi commented 11 months ago

Up to controller-runtime v0.15.0 this was an issue just for the reconciler/watch but not really for the manager cache. So we already had a bit of logic in our operator to detect (only at startup time, a restart is eventually required to react to the deployment of an initially missing CRD) the optional and missing kinds and avoid watching them. We solved this simply anticipating this logic in order to be able also to configure the cache accordingly to those results. If fixing this is not an option, I'd suggest at least to document it as a reference for our users.

vincepri commented 11 months ago

What's the error message you're getting now?

tiraboschi commented 11 months ago

failed to determine if *v1.ConsoleQuickStart is namespaced: failed to get restmapping: failed to find API group "console.openshift.io""

(for a different instance) failed to determine if *v1.ConsoleQuickStart is namespaced: failed to get restmapping: failed to find API group "console.openshift.io""

vincepri commented 11 months ago

That was introduced in https://github.com/kubernetes-sigs/controller-runtime/commit/3e35cab1ea29145d8699259b079633a2b8cfc116. The checks there sound reasonable to me in general because we expect the CRDs to be installed when configuring the cache.

Allowing to register types that aren't yet installed in the cluster, or not having the ability to inspect if an object is namespaced upon configuring the cache isn't something we should do imo, given we're trying to avoid potential footguns.

rickysway commented 10 months ago

I fear we are not going the only one hitting this.

Recently having moved up to controller-runtime 0.16.x as well, and we are seeing a similar error in our Operator Pod:

failed to get restmapping: no matches for kind \"KindB\" in group \"GroupB\"

Our process: 1) Install Operator A which has an optional dependency on Operator B 2) Operator A starts running 3) Operator B is installed at a later date (which creates the CRD for KindB/GroupB) 4) Operator A sees Operator B is installed and tries to create resource KindB/GroupB 5) Operator A errors with the above (The only way to recover is to bounce the Pod to re-initialise Operator A's Controller Manager)

Notable this issue has only occurred for us as of the commit mentioned above in 0.16.0, previous versions would allow us to create KindB/GroupB if the CRDs are loaded in at a later time than Operator A was installed.

Just to confirm, the changes from

    mgr, err := ctrl.NewManager(cfg, ctrl.Options{
        Namespace:              namespace,
        MetricsBindAddress:     fmt.Sprintf("%s:%d", metricsHost, metricsPort),
        HealthProbeBindAddress: fmt.Sprintf("%s:%d", healthzHost, healthzPort),
    })

to

    defaultNamespaces := map[string]cache.Config{
        namespace: {},
    }
    if namespace == "" {
        defaultNamespaces = nil
    }

    mgr, err := ctrl.NewManager(cfg, ctrl.Options{
        Cache: cache.Options{
            DefaultNamespaces: defaultNamespaces,
        },
        Metrics: server.Options{
            BindAddress: fmt.Sprintf("%s:%d", metricsHost, metricsPort),
        },
        HealthProbeBindAddress: fmt.Sprintf("%s:%d", healthzHost, healthzPort),
    })

was indeed the right thing to do.

Happy to move my issue into a new ticket to avoid bloating this one

akalenyu commented 9 months ago

That was introduced in 3e35cab. The checks there sound reasonable to me in general because we expect the CRDs to be installed when configuring the cache.

Allowing to register types that aren't yet installed in the cluster, or not having the ability to inspect if an object is namespaced upon configuring the cache isn't something we should do imo, given we're trying to avoid potential footguns.

I definitely understand the footgun concern, but, if this is just about finding out whether a resource is namespaced or not, would you consider a patch that allows the user to declare if a resource is namespaced or not upfront, as an escape hatch?

Honestly, to me, it was a bit confusing to learn that the default dynamic&lazy mapper is not really dynamic (Which is a change from behavior compared to < 0.16)

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/controller-runtime/issues/2456#issuecomment-2028479830): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / controller-runtime

Cache doesn't start if it's config refers to types that are not installed #2456