Closed tiraboschi closed 8 months ago
@alvaroaleman FYI
So you have kinds in your cache config that you don't watch, since you say the config worked in 0.15? That doesn't seem correct
/kind bug support
/retitle Cache doesn't start if it's config refers to types that are not installed
So you have kinds in your cache config that you don't watch, since you say the config worked in 0.15? That doesn't seem correct
No, not really. During its initialization our operator is using the APIReader client (whis is not cache based) to try getting specific "optional" (like OKD/OCP console kinds since we have a dynamic plugin for the UI or the monitoring stack that can be skipped on low footprint environments) resources and according to the eventual error we detect a missing kind.
Based on those results, we add the "optional" kinds to the list of watched ones: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/ae66846d229c1500aadcf48cb24b2e95b2a0547d/controllers/hyperconverged/hyperconverged_controller.go#L190-L207
This was working in v0.15 and earlier, but now it's not working with v0.16 since we fail the initialization of the manager due to the missing kinds in cache sanity checks.
A possible workaround on our side is bootstrapping a first manager with no custom cache (or using another kind of client just for that), detect the presence of the optional kinds earlier and use its results to bootstrap the real manager we are going to use with a cache configured only for existing kinds but this sounds overkill to me.
Well if you already have code to detect presence and create watches, you can use that same code to conditionally add cache config for those types depending on presence
Well if you already have code to detect presence and create watches, you can use that same code to conditionally add cache config for those types depending on presence
But we have another issue here:
currently our detect code is using the non-caching client we get from mgr.GetAPIReader()
, but with v0.16 we are not able to reach that stage single we fail earlier in mgr, err := manager.New(
when we try to instantiate the manager with a cache config referring missing kinds.
Can we instantiate the manager with the default cache and configure a custom one only in a later stage?
You don't need a manager to get a client. You can use client.New to get a client
Thanks, it correctly worked for our case.
Can this be closed? /remove-kind bug
Can this be closed?
I fear we are not going the only one hitting this. If we cannot really prevent it, I think we should at least document it somewhere.
I fear we are not going the only one hitting this.
Any data to back that up? IMHO setting up config for types you don't actually use is pretty questionable.
No, honestly I was just guessing. But here the point is not that we want to config the cache for kinds that we are not going to use but that we do not want to fail when we try to configure the cache for "optional" kinds (like PrometheusRule) that could potentially not be available at runtime on the cluster if the cluster admin decided to not install a "soft-dependency" (like the monitoring stack to keep the cluster small).
To clarify the ask, is to handle "optional" reconcilers/controllers that might or might not be installed in the cluster?
Yes, correct: let's say for example that we want to reconcile a set of PrometheusRules relative to our sw package but just in the case the cluster admin decided to deploy also a monitoring stack on his cluster.
For now controller runtime requires every CRD to be available and served before any reconciler or cache is running. The ByObject
struct might allow a conditional use case like you're describing, but there are extensive changes that need to happen in case this isn't a "startup problem", but rather you might want to react to changes once the CRDs are installed.
Has this problem solved somehow downstream yet?
Up to controller-runtime v0.15.0 this was an issue just for the reconciler/watch but not really for the manager cache. So we already had a bit of logic in our operator to detect (only at startup time, a restart is eventually required to react to the deployment of an initially missing CRD) the optional and missing kinds and avoid watching them. We solved this simply anticipating this logic in order to be able also to configure the cache accordingly to those results. If fixing this is not an option, I'd suggest at least to document it as a reference for our users.
What's the error message you're getting now?
failed to determine if *v1.ConsoleQuickStart is namespaced: failed to get restmapping: failed to find API group "console.openshift.io""
(for a different instance)
failed to determine if *v1.ConsoleQuickStart is namespaced: failed to get restmapping: failed to find API group "console.openshift.io""
That was introduced in https://github.com/kubernetes-sigs/controller-runtime/commit/3e35cab1ea29145d8699259b079633a2b8cfc116. The checks there sound reasonable to me in general because we expect the CRDs to be installed when configuring the cache.
Allowing to register types that aren't yet installed in the cluster, or not having the ability to inspect if an object is namespaced upon configuring the cache isn't something we should do imo, given we're trying to avoid potential footguns.
I fear we are not going the only one hitting this.
Recently having moved up to controller-runtime
0.16.x
as well, and we are seeing a similar error in our Operator Pod:
failed to get restmapping: no matches for kind \"KindB\" in group \"GroupB\"
Our process: 1) Install Operator A which has an optional dependency on Operator B 2) Operator A starts running 3) Operator B is installed at a later date (which creates the CRD for KindB/GroupB) 4) Operator A sees Operator B is installed and tries to create resource KindB/GroupB 5) Operator A errors with the above (The only way to recover is to bounce the Pod to re-initialise Operator A's Controller Manager)
Notable this issue has only occurred for us as of the commit mentioned above in 0.16.0
, previous versions would allow us to create KindB/GroupB if the CRDs are loaded in at a later time than Operator A was installed.
Just to confirm, the changes from
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
Namespace: namespace,
MetricsBindAddress: fmt.Sprintf("%s:%d", metricsHost, metricsPort),
HealthProbeBindAddress: fmt.Sprintf("%s:%d", healthzHost, healthzPort),
})
to
defaultNamespaces := map[string]cache.Config{
namespace: {},
}
if namespace == "" {
defaultNamespaces = nil
}
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
Cache: cache.Options{
DefaultNamespaces: defaultNamespaces,
},
Metrics: server.Options{
BindAddress: fmt.Sprintf("%s:%d", metricsHost, metricsPort),
},
HealthProbeBindAddress: fmt.Sprintf("%s:%d", healthzHost, healthzPort),
})
was indeed the right thing to do.
Happy to move my issue into a new ticket to avoid bloating this one
That was introduced in 3e35cab. The checks there sound reasonable to me in general because we expect the CRDs to be installed when configuring the cache.
Allowing to register types that aren't yet installed in the cluster, or not having the ability to inspect if an object is namespaced upon configuring the cache isn't something we should do imo, given we're trying to avoid potential footguns.
I definitely understand the footgun concern, but, if this is just about finding out whether a resource is namespaced or not, would you consider a patch that allows the user to declare if a resource is namespaced or not upfront, as an escape hatch?
Honestly, to me, it was a bit confusing to learn that the default dynamic&lazy mapper is not really dynamic (Which is a change from behavior compared to < 0.16)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
controller-runtime v0.16.0 fails starting the manager when the operator is using cache.Options with kinds identified
ByObject
that are not present on the specific cluster.This could happen for instance trying to finetune upfront the cache also for monitoring kinds when the prometheous stack is not deployed or OKD/OCP specific kinds when on vanilla k8s.
Now the operator dies with something like:
the same code was working with controller-runtime <= v0.15.0.
This is probably got probably changed with: https://github.com/kubernetes-sigs/controller-runtime/pull/2421 The error comes from https://github.com/kubernetes-sigs/controller-runtime/blob/c20ea143a236a34fb331e6c04820b75aac444e7d/pkg/cache/cache.go#L399-L405 that is supposed to use
isNamespaced
during cache initialization to ensure thatbyObject.Namespaces
is not set for cluster-scope kind.A simple fix is just about skipping that additional check for non existing kinds instead of breaking the cache initialization and so the manager one.