Closed shaneutt closed 2 days ago
Yeah, this is a common-ish problem and something that controller-runtime doesn't help with. On a conceptional level, you likely want the handler itself to be a controller (i.E. have a workqueue and a reconciler that can fail, resulting in retries with exponential backoff).
I've built something like this that uses a source.Chanel
to pass reconcile.Requests around some time ago: https://github.com/openshift/ci-tools/blob/946582d9de6c25b241fcaa7ffe9f1fe9ff2921c8/pkg/controller/promotionreconciler/reconciler.go#L173
Maybe the easiest would be to built a EnqueueRequestsFromFallibleMapFunc(func(client.Object) ([]reconcile.Request, error))
that internally constructs a controller whose Reconciler
just calls the mapfunc until it succeeds?
Thanks for your reply @alvaroaleman.
Maybe the easiest would be to built a EnqueueRequestsFromFallibleMapFunc(func(client.Object) ([]reconcile.Request, error)) that internally constructs a controller whose Reconciler just calls the mapfunc until it succeeds?
My understanding of what you're suggesting is that EnqueueRequestsFromFallibleMapFunc
would be an alternative map creator in handler
and under the hood it would simply call AddRateLimited(obj)
on the rate limited work queue any time an error occurs to just keep enqueuing that object until the errors go away. This is pretty much the kind of thing I had in mind, so I'm in favor, but is that accurate, or was there more to it than that? If that ends up being accurate and there's consensus amongst maintainers that this is something that would be accepted, I myself or one of my colleagues would like to start work on this.
Hi đź‘‹
As I think of this problem more and more I believe that one more option would be to either introduce an additional API that would account for the errors and have the possibility of having additional logic for retries, something like
ctrl.NewControllerManagedBy(mgr).WatchesWithRetry(
&source.Kind{Type: &customapi.Tree},
handler.EnqueueRequestsFromMapFuncWithError(r.mapFruitsFromTree),
ctrl.Retry{
Max: 3,
RetryTime: 3 * time.Second,
// Or even more complex logic
},
)
(not ideal: additional method to implement, document test and propagate to users) OR make an adapter(s) which would encapsulate the retry logic in itself, wrapping the calls to EnqueueRequestsFromMapFunc
(basically EnqueueRequestsFromFallibleMapFunc
with a logic to stop if particular conditions are met).
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
@alvaroaleman are you still on board with the solution you and I discussed some months prior? Is anyone opposed to us going ahead and fixing this as we described above?
@shaneutt Open to learn a bit more what you have in mind, we've recently added a ctx (only on the main branch) as first parameter in the MapFunc, although error handling and retries can be a good solution. If we go with a handler.RetriableMapFunc or similar, how would the logic handle errors?
@shaneutt Open to learn a bit more what you have in mind, we've recently added a ctx (only on the main branch) as first parameter in the MapFunc
I can see the ctx being helpful in some situations for sure :+1:
although error handling and retries can be a good solution. If we go with a handler.RetriableMapFunc or similar, how would the logic handle errors?
Errors that occur this way are expected to be transient, but the reality is that if we give people an infinite error retry mechanism there will be some cases where it's going to be a footgun.
After thinking more about how this could go wrong and considering that going forward a ctx will be available: I'm actually starting to think that what we need is a mechanism to enable re-queuing which would make the downstream developer responsible for handling the errors instead of controller-runtime, but still give them a way to retry or re-queue the object when problems arise during mapping, but that be something that's more explicit and whether or not the reason is an error
is their prerogative. Perhaps something where:
func (r *myReconciler) mapFruitsFromTree(tree client.Object) (recs []reconcile.Request) {
var fruits customapi.FruitList
if err := r.client.List(context.Background(), &fruits); err != nil {
r.log.WithError(err).Error("error listing fruits")
return
}
for _, fruit := range fruits.Items {
if isRelated(tree, fruit) {
recs = append(recs, reconcile.Request{
NamespacedName: types.NamespacedName{
Namespace: fruit.Namespace,
Name: fruit.Name,
},
})
}
}
return
}
becomes:
func (r *myReconciler) mapFruitsFromTree(tree client.Object) (requeue bool, recs []reconcile.Request) {
var fruits customapi.FruitList
if err := r.client.List(context.Background(), &fruits); err != nil {
r.log.WithError(err).Error("error listing fruits")
requeue = true
return
}
for _, fruit := range fruits.Items {
if isRelated(tree, fruit) {
recs = append(recs, reconcile.Request{
NamespacedName: types.NamespacedName{
Namespace: fruit.Namespace,
Name: fruit.Name,
},
})
}
}
return
}
Curious as to your thoughts on that :thinking:
I like a combination of adding the ctx and the requeue bool.
We are also running into this issue where we need to list objects in our mapping function. If we wait on context.Background() forever, that's going to block the controller (somewhere? not sure). If we add a timeout and log the error, our system is going to end up in an inconsistent state.
@shaneutt
Errors that occur this way are expected to be transient, but the reality is that if we give people an infinite error retry mechanism there will be some cases where it's going to be a footgun. If the client is returning errors consistently, then it's not like the controller would be able to make progress otherwise.
Can you explain the footgun more? If these mapper functions need to work correctly to keep the system consistent, then infinite retry seems necessary.
It's worth mentioning that the current implementation is a footgun right now because the common usage results in dropped events and there's not really a way to use it correctly.
Been a while since I wrote that but I think I was just a little cautious of weird ways people might use this. Ultimately however I agree with you that we should move forward with this.
@shaneutt what about ([]reconcile.Request, error)
for the return values, where err != nil
will be logged and requeued? That would make the behavior similar to Reconcile. It's a bit more ergonomic than having to create a logger, log your error, and return a bool.
That seems entirely reasonable :+1:
@shaneutt what about
([]reconcile.Request, error)
for the return values, whereerr != nil
will be logged and requeued? That would make the behavior similar to Reconcile. It's a bit more ergonomic than having to create a logger, log your error, and return a bool.
I like this idea as well. One note though: I'd consider a separate type just for the sake of separation of concerns and not coupling this with reconciliation. WDYT?
I put []reconcile.Request
because that's the current type. I don't have a strong opinion on changing it other than we should probably keep it the same unless there is a good reason to change.
@shaneutt what about ([]reconcile.Request, error) for the return values, where err != nil will be logged and requeued?
That isn't possible. We don't know what to requeue when the map function failed, as its purpose is to tell us that :) The only thing we could do is have a handler implementation that internally is a controller and thus has its own workqueue as suggested in https://github.com/kubernetes-sigs/controller-runtime/issues/1996#issuecomment-1241263216
That isn't possible. We don't know what to requeue when the map function failed, as its purpose is to tell us that :)
Sorry for imprecise language. I mean requeuing to call the map function itself again. That sounds like what https://github.com/kubernetes-sigs/controller-runtime/issues/1996#issuecomment-1241263216 is describing.
One thing I noticed today. The newer function signature in v1.15.0 accepts a context.Context:
type MapFunc func(context.Context, client.Object) []reconcile.Request
Any function which accepts a context is inherently fallible or lossy. It doesn't make sense to have a context without returning an error, so an error should probably be added to the existing MapFunc definition, rather than creating a separate fallible MapFunc type.
If we want to have an infallible MapFunc which cannot error, it should not have context.Context.
The caller can still check (Context).Err
, though it's possible that the return value could change in between the called function returning and when the caller checks whether (Context).Err
now returns non-nil.
I take it you're assuming that the called function needs a way to indicate that it had to give up due to the Context
before it could complete all of its intended work.
You could do that but it goes against general go practices. A function that can cause an error should handle it locally or return it up the stack so it can be handled or logged higher up. Expecting the higher function to check (Context).Err
rather than an error is pretty unusual.
It's possible that the return value could change in between the called function returning and when the caller checks whether (Context).Err now returns non-nil.
I suppose even with this race condition, it would be valid for the caller to ignore whatever results came from MapFunc because the caller will fail due to the ctx error. You could get away with this from a technical perspective.
It's still odd. Pretty much any function which the MapFunc could pass that context into will return its own error, which should be passed up the stack. The common usage of that context is for client functions, which can fail.
I'm familiar with Context
's motivations and behavior, and I don't see any relationship between it and the fallibility of functions that use it, other than that it introduces another reason why such functions can "fail"—namely cooperative cancellation.
Functions accept a Context
either as a means of them agreeing to be interrupted—because they do something that would block long enough to warrant respecting a request for interruption—or, less pertinently, because they need to read or bind values with dynamic extent.
+1 from my side to add a return error, this would be a breaking change; although before doing so, we need to nail down what the behavior is going to look like, and how configurable it should be
Functions accept a Context either as a means of them agreeing to be interrupted—because they do something that would block long enough to warrant respecting a request for interruption—or, less pertinently, because they need to read or bind values with dynamic extent.
Even when you need to read or bind values with dynamic extent, that is something that can fail and probably needs error handling (like with logger.FromContext(ctx)
).
It may not be true in every case: you could handle the missing values by providing a default or it may be reasonable to panic if your code is structured in a way that it should never happen. But, this map func API can't make such assumptions about how it is going to be used; it needs to support standard error handling.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
By reading this one, I understand that the change that is missing / convey here is just allow the api return an error: https://github.com/kubernetes-sigs/controller-runtime/issues/1996#issuecomment-1655869605
Then, by looking at the problem:
Now ideally you use metav1.ObjectReference when dealing with relationships between objects, however it can sometimes be the case that you need to use the API client or some other utility that may produce an error to map the related objects,
What seems that you are trying to do is: (unless I misunderstand it) I want watch and trigger the reconcile for an CRD B in the controller for my CRD A That means, my CRD A in certain conditions requires / depend on of my CRD B
I have an old sample project that does that. You can see here that in a controller I am watching my other Kind at the same way that I watch the Services and etc created and owned by it.
Then, it seems for me that it should be done as you do with any core API using the refs so this problem would not be faced. Either you can trigger any other reconciliation directly as suggested in : https://github.com/kubernetes-sigs/controller-runtime/issues/1996#issuecomment-1586257207 .
The referenced controller does rely on the owner relationship https://github.com/dev4devs-com/postgresql-operator/blob/c71c742743270ded079188ada7ced2132b7c820f/pkg/service/watches.go#L13 which is not always the case.
For instance: we implement Gateway API's Gateway
controller and in that controller (as one example) we want to check for changes in GatewayClass
so that the Gateway
s using that class can be enqueued. Something like this:
func (r *Reconciler) Setup() {
...
Watches(
&gatewayv1.GatewayClass{},
handler.EnqueueRequestsFromMapFunc(r.listGatewaysForGatewayClass),
builder.WithPredicates(predicate.NewPredicateFuncs(GatewayClassMatchesController))).
...
}
....
func GatewayClassMatchesController(obj client.Object) bool {
gatewayClass, ok := obj.(*gatewayv1.GatewayClass)
if !ok {
return false
}
return string(gatewayClass.Spec.ControllerName) == vars.ControllerName()
}
...
func (r *Reconciler) listGatewaysForGatewayClass(ctx context.Context, obj client.Object) (recs []reconcile.Request) {
gatewayClass, ok := obj.(*gatewayv1.GatewayClass)
if !ok {
return
}
gateways := new(gatewayv1.GatewayList)
if err := r.Client.List(ctx, gateways); err != nil {
return
}
for _, gateway := range gateways.Items {
if gateway.Spec.GatewayClassName == gatewayv1.ObjectName(gatewayClass.Name) {
recs = append(recs, reconcile.Request{
NamespacedName: types.NamespacedName{
Namespace: gateway.Namespace,
Name: gateway.Name,
},
})
}
}
return
}
There is not owner relationship between the 2 so we cannot use handler.TypedEnqueueRequestForOwner
or similar. Both the predicate type and handler enqueue map func do not allow to handle the requeues, as a matter of fact, https://github.com/kubernetes-sigs/controller-runtime/blob/f9d51ba63e205155a0d25efff911673ea201426c/pkg/handler/enqueue_owner.go also does not allow requeuing.
Hope that clears out what is (yet another) angle at this, from user PoV.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
Preface
This relates to a conversation I brought up at the community meeting on 2022-09-08, @camilamacedo86 and @jmrodri suggested I create an issue so we can discuss it further.
Background
Currently in
controller-runtime
we have the following tools:With these it's possible to express a
Watch()
on onesource.Kind
which enqueues objects of another kind by some arbitrary relation:Now ideally you use
metav1.ObjectReference
when dealing with relationships between objects, however it can sometimes be the case that you need to use the API client or some other utility that may produce an error to map the related objects, e.g.:In these cases it would appear there is effectively no provisions to handle error conditions.
This kind of implementation and use case is not uncommon. I was able to find several examples of notable open source projects which are using it this way:
Problem
The
EnqueueRequestsFromMapFunc
generator forEventHandlers
has no concept of handling errors that may occur when making the mapping. As such in the example above and all the links provided for projects which are doing this that process is currently "lossy": that is to say, it appears the machinery could effectively end up dropping an enqueued object from the queue if a failure occurred during the mapping process.For a specific example using the
Tree
andFruit
sample above: if the object listing produces an error (e.g. performed at a time when the API server is having trouble and cache can't be used) this failure means that theTree
object is consumed from the queue but the related objects don't get enqueued. This would cause the resource state to become "stuck" and this wont heal without adding some other mechanism to re-queue it, or some unrelated action to re-trigger it.Exploring Solutions
The purpose of this issue is to ask for community guidance on an existing or new provision to support this kind of mapping in a manner that is fault tolerant and can heal from an error.
Some of the workarounds that I have conceived of are:
source.Channel
where objects are re-enqueued in an implementation-specific manner (clunky)So far I think the third option there might be the most effective strategy with what we have today, but ideally what I would like to see is the ability to cleanly trigger a re-queue from within the mapping function as part of the function return. I would very much appreciate community feedback on this.