Lossy map functions - Githubissues

shaneutt commented 2 years ago

Preface

This relates to a conversation I brought up at the community meeting on 2022-09-08, @camilamacedo86 and @jmrodri suggested I create an issue so we can discuss it further.

Background

Currently in controller-runtime we have the following tools:

type MapFunc func(client.Object) []reconcile.Request

func EnqueueRequestsFromMapFunc(fn MapFunc) EventHandler {
    return &enqueueRequestsFromMapFunc{
        toRequests: fn,
    }
}

With these it's possible to express a Watch() on one source.Kind which enqueues objects of another kind by some arbitrary relation:

ctrl.NewControllerManagedBy(mgr).Watches(
    &source.Kind{Type: &customapi.Tree},
    handler.EnqueueRequestsFromMapFunc(r.mapFruitsFromTree),
)

Now ideally you use metav1.ObjectReference when dealing with relationships between objects, however it can sometimes be the case that you need to use the API client or some other utility that may produce an error to map the related objects, e.g.:

func (r *myReconciler) mapFruitsFromTree(tree client.Object) []reconcile.Request {
    var fruits customapi.FruitList
    if err := r.client.List(context.Background(), &fruits); err != nil {
        r.log.WithError(err).Error("error listing fruits")
        return nil
    }

    var req []reconcile.Request
    for _, fruit := range fruits.Items {
        if isRelated(tree, fruit) {
            req = append(req, reconcile.Request{
                NamespacedName: types.NamespacedName{
                    Namespace: fruit.Namespace,
                    Name:      fruit.Name,
                },
            })
        }
    }

    return req
}

In these cases it would appear there is effectively no provisions to handle error conditions.

This kind of implementation and use case is not uncommon. I was able to find several examples of notable open source projects which are using it this way:

Problem

The EnqueueRequestsFromMapFunc generator for EventHandlers has no concept of handling errors that may occur when making the mapping. As such in the example above and all the links provided for projects which are doing this that process is currently "lossy": that is to say, it appears the machinery could effectively end up dropping an enqueued object from the queue if a failure occurred during the mapping process.

For a specific example using the Tree and Fruit sample above: if the object listing produces an error (e.g. performed at a time when the API server is having trouble and cache can't be used) this failure means that the Tree object is consumed from the queue but the related objects don't get enqueued. This would cause the resource state to become "stuck" and this wont heal without adding some other mechanism to re-queue it, or some unrelated action to re-trigger it.

Exploring Solutions

The purpose of this issue is to ask for community guidance on an existing or new provision to support this kind of mapping in a manner that is fault tolerant and can heal from an error.

Some of the workarounds that I have conceived of are:

using object references as mentioned above (wont work in all cases)
some fallback process which would actively re-enqueue all relevant objects (messy, costly)
watching a custom backup source.Channel where objects are re-enqueued in an implementation-specific manner (clunky)

So far I think the third option there might be the most effective strategy with what we have today, but ideally what I would like to see is the ability to cleanly trigger a re-queue from within the mapping function as part of the function return. I would very much appreciate community feedback on this.

alvaroaleman commented 2 years ago

Yeah, this is a common-ish problem and something that controller-runtime doesn't help with. On a conceptional level, you likely want the handler itself to be a controller (i.E. have a workqueue and a reconciler that can fail, resulting in retries with exponential backoff).

I've built something like this that uses a source.Chanel to pass reconcile.Requests around some time ago: https://github.com/openshift/ci-tools/blob/946582d9de6c25b241fcaa7ffe9f1fe9ff2921c8/pkg/controller/promotionreconciler/reconciler.go#L173

Maybe the easiest would be to built a EnqueueRequestsFromFallibleMapFunc(func(client.Object) ([]reconcile.Request, error)) that internally constructs a controller whose Reconciler just calls the mapfunc until it succeeds?

shaneutt commented 2 years ago

Thanks for your reply @alvaroaleman.

Maybe the easiest would be to built a EnqueueRequestsFromFallibleMapFunc(func(client.Object) ([]reconcile.Request, error)) that internally constructs a controller whose Reconciler just calls the mapfunc until it succeeds?

My understanding of what you're suggesting is that EnqueueRequestsFromFallibleMapFunc would be an alternative map creator in handler and under the hood it would simply call AddRateLimited(obj) on the rate limited work queue any time an error occurs to just keep enqueuing that object until the errors go away. This is pretty much the kind of thing I had in mind, so I'm in favor, but is that accurate, or was there more to it than that? If that ends up being accurate and there's consensus amongst maintainers that this is something that would be accepted, I myself or one of my colleagues would like to start work on this.

pmalek commented 2 years ago

Hi 👋

As I think of this problem more and more I believe that one more option would be to either introduce an additional API that would account for the errors and have the possibility of having additional logic for retries, something like

ctrl.NewControllerManagedBy(mgr).WatchesWithRetry(
    &source.Kind{Type: &customapi.Tree},
    handler.EnqueueRequestsFromMapFuncWithError(r.mapFruitsFromTree),
    ctrl.Retry{
        Max: 3,
        RetryTime: 3 * time.Second,
        // Or even more complex logic
    },
)

(not ideal: additional method to implement, document test and propagate to users) OR make an adapter(s) which would encapsulate the retry logic in itself, wrapping the calls to EnqueueRequestsFromMapFunc (basically EnqueueRequestsFromFallibleMapFunc with a logic to stop if particular conditions are met).

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

shaneutt commented 1 year ago

/remove-lifecycle rotten

@alvaroaleman are you still on board with the solution you and I discussed some months prior? Is anyone opposed to us going ahead and fixing this as we described above?

vincepri commented 1 year ago

@shaneutt Open to learn a bit more what you have in mind, we've recently added a ctx (only on the main branch) as first parameter in the MapFunc, although error handling and retries can be a good solution. If we go with a handler.RetriableMapFunc or similar, how would the logic handle errors?

shaneutt commented 1 year ago

@shaneutt Open to learn a bit more what you have in mind, we've recently added a ctx (only on the main branch) as first parameter in the MapFunc

I can see the ctx being helpful in some situations for sure :+1:

although error handling and retries can be a good solution. If we go with a handler.RetriableMapFunc or similar, how would the logic handle errors?

Errors that occur this way are expected to be transient, but the reality is that if we give people an infinite error retry mechanism there will be some cases where it's going to be a footgun.

After thinking more about how this could go wrong and considering that going forward a ctx will be available: I'm actually starting to think that what we need is a mechanism to enable re-queuing which would make the downstream developer responsible for handling the errors instead of controller-runtime, but still give them a way to retry or re-queue the object when problems arise during mapping, but that be something that's more explicit and whether or not the reason is an error is their prerogative. Perhaps something where:

func (r *myReconciler) mapFruitsFromTree(tree client.Object) (recs []reconcile.Request) {
    var fruits customapi.FruitList
    if err := r.client.List(context.Background(), &fruits); err != nil {
        r.log.WithError(err).Error("error listing fruits")
        return
    }

    for _, fruit := range fruits.Items {
        if isRelated(tree, fruit) {
            recs = append(recs, reconcile.Request{
                NamespacedName: types.NamespacedName{
                    Namespace: fruit.Namespace,
                    Name:      fruit.Name,
                },
            })
        }
    }

    return
}

becomes:

func (r *myReconciler) mapFruitsFromTree(tree client.Object) (requeue bool, recs []reconcile.Request) {
    var fruits customapi.FruitList
    if err := r.client.List(context.Background(), &fruits); err != nil {
        r.log.WithError(err).Error("error listing fruits")
        requeue = true
        return
    }

    for _, fruit := range fruits.Items {
        if isRelated(tree, fruit) {
            recs = append(recs, reconcile.Request{
                NamespacedName: types.NamespacedName{
                    Namespace: fruit.Namespace,
                    Name:      fruit.Name,
                },
            })
        }
    }

    return
}

Curious as to your thoughts on that :thinking:

nathanperkins commented 1 year ago

I like a combination of adding the ctx and the requeue bool.

We are also running into this issue where we need to list objects in our mapping function. If we wait on context.Background() forever, that's going to block the controller (somewhere? not sure). If we add a timeout and log the error, our system is going to end up in an inconsistent state.

nathanperkins commented 1 year ago

@shaneutt

Errors that occur this way are expected to be transient, but the reality is that if we give people an infinite error retry mechanism there will be some cases where it's going to be a footgun. If the client is returning errors consistently, then it's not like the controller would be able to make progress otherwise.

Can you explain the footgun more? If these mapper functions need to work correctly to keep the system consistent, then infinite retry seems necessary.

It's worth mentioning that the current implementation is a footgun right now because the common usage results in dropped events and there's not really a way to use it correctly.

shaneutt commented 1 year ago

Been a while since I wrote that but I think I was just a little cautious of weird ways people might use this. Ultimately however I agree with you that we should move forward with this.

nathanperkins commented 1 year ago

@shaneutt what about ([]reconcile.Request, error) for the return values, where err != nil will be logged and requeued? That would make the behavior similar to Reconcile. It's a bit more ergonomic than having to create a logger, log your error, and return a bool.

shaneutt commented 1 year ago

That seems entirely reasonable :+1:

pmalek commented 1 year ago

@shaneutt what about ([]reconcile.Request, error) for the return values, where err != nil will be logged and requeued? That would make the behavior similar to Reconcile. It's a bit more ergonomic than having to create a logger, log your error, and return a bool.

I like this idea as well. One note though: I'd consider a separate type just for the sake of separation of concerns and not coupling this with reconciliation. WDYT?

nathanperkins commented 1 year ago

I put []reconcile.Request because that's the current type. I don't have a strong opinion on changing it other than we should probably keep it the same unless there is a good reason to change.

alvaroaleman commented 1 year ago

@shaneutt what about ([]reconcile.Request, error) for the return values, where err != nil will be logged and requeued?

That isn't possible. We don't know what to requeue when the map function failed, as its purpose is to tell us that :) The only thing we could do is have a handler implementation that internally is a controller and thus has its own workqueue as suggested in https://github.com/kubernetes-sigs/controller-runtime/issues/1996#issuecomment-1241263216

nathanperkins commented 1 year ago

That isn't possible. We don't know what to requeue when the map function failed, as its purpose is to tell us that :)

Sorry for imprecise language. I mean requeuing to call the map function itself again. That sounds like what https://github.com/kubernetes-sigs/controller-runtime/issues/1996#issuecomment-1241263216 is describing.

nathanperkins commented 1 year ago

One thing I noticed today. The newer function signature in v1.15.0 accepts a context.Context:

type MapFunc func(context.Context, client.Object) []reconcile.Request

Any function which accepts a context is inherently fallible or lossy. It doesn't make sense to have a context without returning an error, so an error should probably be added to the existing MapFunc definition, rather than creating a separate fallible MapFunc type.

If we want to have an infallible MapFunc which cannot error, it should not have context.Context.

seh commented 1 year ago

The caller can still check (Context).Err, though it's possible that the return value could change in between the called function returning and when the caller checks whether (Context).Err now returns non-nil.

I take it you're assuming that the called function needs a way to indicate that it had to give up due to the Context before it could complete all of its intended work.

nathanperkins commented 1 year ago

You could do that but it goes against general go practices. A function that can cause an error should handle it locally or return it up the stack so it can be handled or logged higher up. Expecting the higher function to check (Context).Err rather than an error is pretty unusual.

It's possible that the return value could change in between the called function returning and when the caller checks whether (Context).Err now returns non-nil.

I suppose even with this race condition, it would be valid for the caller to ignore whatever results came from MapFunc because the caller will fail due to the ctx error. You could get away with this from a technical perspective.

It's still odd. Pretty much any function which the MapFunc could pass that context into will return its own error, which should be passed up the stack. The common usage of that context is for client functions, which can fail.

seh commented 1 year ago

I'm familiar with Context's motivations and behavior, and I don't see any relationship between it and the fallibility of functions that use it, other than that it introduces another reason why such functions can "fail"—namely cooperative cancellation.

Functions accept a Context either as a means of them agreeing to be interrupted—because they do something that would block long enough to warrant respecting a request for interruption—or, less pertinently, because they need to read or bind values with dynamic extent.

vincepri commented 1 year ago

+1 from my side to add a return error, this would be a breaking change; although before doing so, we need to nail down what the behavior is going to look like, and how configurable it should be

nathanperkins commented 1 year ago

Functions accept a Context either as a means of them agreeing to be interrupted—because they do something that would block long enough to warrant respecting a request for interruption—or, less pertinently, because they need to read or bind values with dynamic extent.

Even when you need to read or bind values with dynamic extent, that is something that can fail and probably needs error handling (like with logger.FromContext(ctx)).

It may not be true in every case: you could handle the missing values by providing a default or it may be reasonable to panic if your code is structured in a way that it should never happen. But, this map func API can't make such assumptions about how it is going to be used; it needs to support standard error handling.

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

pmalek commented 7 months ago

/remove-lifecycle rotten

camilamacedo86 commented 5 months ago

By reading this one, I understand that the change that is missing / convey here is just allow the api return an error: https://github.com/kubernetes-sigs/controller-runtime/issues/1996#issuecomment-1655869605

Then, by looking at the problem:

Now ideally you use metav1.ObjectReference when dealing with relationships between objects, however it can sometimes be the case that you need to use the API client or some other utility that may produce an error to map the related objects,

What seems that you are trying to do is: (unless I misunderstand it) I want watch and trigger the reconcile for an CRD B in the controller for my CRD A That means, my CRD A in certain conditions requires / depend on of my CRD B

I have an old sample project that does that. You can see here that in a controller I am watching my other Kind at the same way that I watch the Services and etc created and owned by it.

Then, it seems for me that it should be done as you do with any core API using the refs so this problem would not be faced. Either you can trigger any other reconciliation directly as suggested in : https://github.com/kubernetes-sigs/controller-runtime/issues/1996#issuecomment-1586257207 .

pmalek commented 5 months ago

The referenced controller does rely on the owner relationship https://github.com/dev4devs-com/postgresql-operator/blob/c71c742743270ded079188ada7ced2132b7c820f/pkg/service/watches.go#L13 which is not always the case.

For instance: we implement Gateway API's Gateway controller and in that controller (as one example) we want to check for changes in GatewayClass so that the Gateways using that class can be enqueued. Something like this:

func (r *Reconciler) Setup() {

...
        Watches(
            &gatewayv1.GatewayClass{},
            handler.EnqueueRequestsFromMapFunc(r.listGatewaysForGatewayClass),
            builder.WithPredicates(predicate.NewPredicateFuncs(GatewayClassMatchesController))).

...

}

....

func GatewayClassMatchesController(obj client.Object) bool {
    gatewayClass, ok := obj.(*gatewayv1.GatewayClass)
    if !ok {
        return false
    }

    return string(gatewayClass.Spec.ControllerName) == vars.ControllerName()
}

...

func (r *Reconciler) listGatewaysForGatewayClass(ctx context.Context, obj client.Object) (recs []reconcile.Request) {
    gatewayClass, ok := obj.(*gatewayv1.GatewayClass)
    if !ok {
        return
    }

    gateways := new(gatewayv1.GatewayList)
    if err := r.Client.List(ctx, gateways); err != nil {
        return
    }

    for _, gateway := range gateways.Items {
        if gateway.Spec.GatewayClassName == gatewayv1.ObjectName(gatewayClass.Name) {
            recs = append(recs, reconcile.Request{
                NamespacedName: types.NamespacedName{
                    Namespace: gateway.Namespace,
                    Name:      gateway.Name,
                },
            })
        }
    }

    return
}

There is not owner relationship between the 2 so we cannot use handler.TypedEnqueueRequestForOwner or similar. Both the predicate type and handler enqueue map func do not allow to handle the requeues, as a matter of fact, https://github.com/kubernetes-sigs/controller-runtime/blob/f9d51ba63e205155a0d25efff911673ea201426c/pkg/handler/enqueue_owner.go also does not allow requeuing.

Hope that clears out what is (yet another) angle at this, from user PoV.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 days ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 days ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/controller-runtime/issues/1996#issuecomment-2453424993): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / controller-runtime

Lossy map functions #1996

Preface

Background

Problem

Exploring Solutions