mattmoor commented 4 years ago

Expected Behavior

Generally best practice dictates that "gets" in the Reconcile loop are done from the informer cache so that when the reconciliation is a nop (e.g. a resync on a done task) there is zero traffic to the API server.

Actual Behavior

I found the following in the pipeline resolver:

// GetPipeline will resolve a Pipeline from the local cluster using a versioned Tekton client. It will
// return an error if it can't find an appropriate Pipeline for any reason.
func (l *LocalPipelineRefResolver) GetPipeline(name string) (v1beta1.PipelineInterface, error) {
    // If we are going to resolve this reference locally, we need a namespace scope.
    if l.Namespace == "" {
        return nil, fmt.Errorf("Must specify namespace to resolve reference to pipeline %s", name)
    }
    return l.Tektonclient.TektonV1beta1().Pipelines(l.Namespace).Get(name, metav1.GetOptions{})
}

This explicit read through the client happens on every pipeline run reconciliation, even for done pipeline runs, which should be as lightweight as possible.

This logic is in github.com/tektoncd/pipeline/pkg/reconciler/pipelinerun/resources/pipelineref.go

I found it by adding the following check when playing with a test for idempotency:

    // Above there is an initial Reconcile...  ☝️ 

    // We want to check that no client actions happen.
    clients.Pipeline.ClearActions()

    err = c.Reconciler.Reconcile(context.Background(), "foo/bar")
    if err != nil {
        t.Errorf("Did not expect to see error when reconciling completed PipelineRun but saw %s", err)
    }
    t.Log(clients.Pipeline.Actions())
    if got, want := len(clients.Pipeline.Actions()), 0; got != want {
        t.Errorf("len(pipeline actions) = %d, wanted %d", got, want)
    }

/kind bug

afrittoli commented 4 years ago

Oh, nice catch, thank you!

imjasonh commented 4 years ago

Is there some kind of unit test we could write to try to detect and prevent usage of the client to Get? We have it available to be able to Create and UpdateStatus. Maybe we could wrap it in an interface-compliant struct that fails loudly on Get?

bobcatfish commented 4 years ago

I'm under the impression that we are deeply suspicious that relying on the informer instead of doing a get could result in errors due to stale data. I'd love to get to the bottom of this and learn that we actually can rely on the listers and informers.

In this PR that was closed you can see us debating and not really knowing: https://github.com/tektoncd/pipeline/pull/1825

Is there somewhere we're using a lister in the reconciler to retrieve a Task/Pipeline, where we should be using a direct client to get them, bypassing caching? We've seen listers being slow to notice new objects in the past, and have (mostly?) replaced them with regular k8s clients as a result.

I don't think we really ever got to the bottom of whether or not relying on listers could cause us problems, but it does seem like it conceivably could.

With PipelineRuns and TaskRuns, if the resources they depend on do not exist, we want them to fail, unlike other k8s types which can tolerate those types not existing and remain pending until they exist. If we rely on caches to retrieve resources needed by PipelienRuns and TaskRuns, then you could imagine if we create a Pipeline and a PipelineRun that references it simultaneously, the Pipeline might not be in the cache and the PipelineRun could fail. Or at least conceptually this seems like it makes sense!! Extremely happy to learn this is not the case :D

afrittoli commented 4 years ago

@bobcatfish You make a very good point, thank you!

Since we have no mechanism to to cache fetched resources, today we fetch them on every reconcile cycle - which is quite expensive (see https://github.com/tektoncd/pipeline/pull/2690).

I think there are a couple of possible solutions to this issues: a. When we fetch resources, try the informer informer lister first. If that fails try with the client. b. Consider fetching a resource as a transient error. The key will be re-queued, and at the next reconcile cycle the informer cache should be updated. This has the advantage that we would never use the client directly, but we would need to make sure that we track how many times we try to fetch resources.

Option (b) feels like the more natural for me, but it might degrade the start time when Task and TaskRun are applied together from the same file. For instance, @skaegi knows better here, at IBM Cloud we use ephemeral namespaces to run pipelineruns, and all resources are provisioned in there right before they are executed.

vdemeester commented 4 years ago

I think we were also using this because we were using Update and that would not be reflected in the informers cache (whereas using Patch does) (see https://github.com/tektoncd/pipeline/issues/2729#issuecomment-637959273) — @mattmoor correct me if I am wrong.

I think there are a couple of possible solutions to this issues: a. When we fetch resources, try the informer informer lister first. If that fails try with the client. b. Consider fetching a resource as a transient error. The key will be re-queued, and at the next reconcile cycle the informer cache should be updated. This has the advantage that we would never use the client directly, but we would need to make sure that we track how many times we try to fetch resources.

Yep, (b) feels also more natural for me.

mattmoor commented 4 years ago

Is there some kind of unit test we could write to try to detect and prevent usage of the client to Get?

Yes, you can look for GetAction on the fake clients. I'd take a hard look at List too.

relying on the informer instead of doing a get could result in errors due to stale data

Yes, using the informer cache can result in stale reads. Yes, it may not yet have observed a resource yet when reconciling another resource created in the same kubectl apply.

I think a worse case is when it does exist, but a mutation may or not be picked up, e.g.

kind: Task
...  # change something
---
kind: TaskRun
...

With Get, the above works reliably (TaskRun must be processed after Task is written to etcd), but what about this:

kind: TaskRun
...
---
kind: Task
...  # change something

Even without the use of an informer cache, the above will sometimes get the new Task, and sometimes get the old Task (or no task). "But nobody would write that!" Well, as-is... maybe. However, the above is the moral equivalent of:

kaniko.yaml:

kind: Task
...

build-dockerfile.yaml:

kind: TaskRun
... # use kaniko

When used with kubectl apply -f dir/ because b comes before k :)

I'm of two minds on this:

Tekton has a legit reason to do a Get bypassing the cache to avoid stale reads.
Tekton has made it an anti-pattern for users to create Task/Pipeline at the same time as their Run resources (it is relatively easy to hold wrong, so I'd frankly advise against it). So while I see the logic in 1. if you hold things right, it is still possible for folks to hold wrong. I'd go so far as to say that it is likely* that some subset of users will hold it wrong, and some subset of those will get burned (right away). So 1. is optimizing for an anti-pattern at the expense of additional load on the API Server 🤔

(brace for huge digression)

The way we solve this in Knative is the Configuration / Revision duality. Configurations float forward over time, and stamp out Revisions: immutable snapshots of that Configuration (I like the parallel to Git branches and commits). In our Route resource, we let users point to the Configuration (YOLO, :latest, I'm feeling lucky), or we let folks point to a specific Revision of that Configuration. I used a similar trick for ConfigMaps in this PoC: https://github.com/mattmoor/boo-maps

I think the analog in Tekton would be (different terminology used for clarity, not trying to paint a shed):

TaskDefinition --(stamps out)--> TaskRevision

Then in a TaskRun I could reference either a TaskDefinition (YOLO, :latest, I'm feeling lucky) or a TaskRevision (I know what I want!). This also gives a better place to resolve tags to digests and an immutable snapshot to reference and avoid all that inlining 😬, but I digress.

A nuance here is: How do I reference a revision when it hasn't been created? Without that I can't update a TaskDefinition and create a TaskRun in the same kubectl apply! I need to look up the name 🤮 . We have a mediocre solution for that as well, but I've digressed enough 😉

mattmoor commented 4 years ago

correct me if I am wrong

Not quite! The point with Patch was that it isn't a read-modify-write, so you avoid the stale read. The Patch also doesn't include the metadata.resourceVersion field, so it avoids optimistic concurrency checks.

mattmoor commented 4 years ago

Yep, (b) feels also more natural for me.

@afrittoli @vdemeester We raced a little bit (Oh the irony 🤣 ), but see my longer response above. My bigger concern here is NOT "wasn't created yet", but updates that aren't reflected yet. It's right before my huge digression 😅

bobcatfish commented 4 years ago

b. Consider fetching a resource as a transient error. The key will be re-queued, and at the next reconcile cycle the informer cache should be updated. This has the advantage that we would never use the client directly, but we would need to make sure that we track how many times we try to fetch resources.

This would beg the question of how many times we retry before failing: in the current design, as soon as this fails, the PipelineRun fails. If you start retrying, if users specify the name of a resource that doesn't exist, the Run will wait indefinitely.

That sounds okay since it's how a lot of other Kubernetes resources work BUT I would say there is one big downside:

If the resource springs into existence later, the Run will suddenly execute. Imagine I create a PipelineRun "foo-1" for Pipeline "foo" but do not create "foo". Two days later I create "foo" - do I want "foo-1" to run at that point? Probably not.

I think a worse case is when it does exist, but a mutation may or not be picked up, e.g.

That's a good point. I'm not sure we need to optimize for the case where Tasks are mutated, especially as we are moving toward versioned Tasks, so I would avoid adding an entire new object to handle this case.

That actually brings up a good point though: maybe this concern is less important given we are moving to a model where we will likely prefer to reference Task and Pipelines in an OCI registry? https://github.com/tektoncd/pipeline/issues/1839

today we fetch them on every reconcile cycle - which is quite expensive (see #2690).

can we put some numbers to this? i want to avoid optimizing until we know what the current state is and whether it's worth the cost

afrittoli commented 4 years ago

b. Consider fetching a resource as a transient error. The key will be re-queued, and at the next reconcile cycle the informer cache should be updated. This has the advantage that we would never use the client directly, but we would need to make sure that we track how many times we try to fetch resources.

This would beg the question of how many times we retry before failing: in the current design, as soon as this fails, the PipelineRun fails. If you start retrying, if users specify the name of a resource that doesn't exist, the Run will wait indefinitely.

I think it would be fair to retry a fixed number of time, one or two, whatever is needed to ensure the cache is not stale. I don't think we need to support the case of TaskRuns created at a later time (outside of the initial kubectl apply.

That sounds okay since it's how a lot of other Kubernetes resources work BUT I would say there is one big downside:

If the resource springs into existence later, the Run will suddenly execute. Imagine I create a PipelineRun "foo-1" for Pipeline "foo" but do not create "foo". Two days later I create "foo" - do I want "foo-1" to run at that point? Probably not.

I think a worse case is when it does exist, but a mutation may or not be picked up, e.g.

That's a good point. I'm not sure we need to optimize for the case where Tasks are mutated, especially as we are moving toward versioned Tasks, so I would avoid adding an entire new object to handle this case.

That actually brings up a good point though: maybe this concern is less important given we are moving to a model where we will likely prefer to reference Task and Pipelines in an OCI registry? #1839

today we fetch them on every reconcile cycle - which is quite expensive (see #2690).

can we put some numbers to this? i want to avoid optimizing until we know what the current state is and whether it's worth the cost

afrittoli commented 4 years ago

Is there some kind of unit test we could write to try to detect and prevent usage of the client to Get?

Yes, you can look for GetAction on the fake clients. I'd take a hard look at List too.

relying on the informer instead of doing a get could result in errors due to stale data

Yes, using the informer cache can result in stale reads. Yes, it may not yet have observed a resource yet when reconciling another resource created in the same kubectl apply.

I think a worse case is when it does exist, but a mutation may or not be picked up, e.g.
kind: Task
...  # change something
---
kind: TaskRun
...
With Get, the above works reliably (TaskRun must be processed after Task is written to etcd), but what about this:
kind: TaskRun
...
---
kind: Task
...  # change something
Even without the use of an informer cache, the above will sometimes get the new Task, and sometimes get the old Task (or no task). "But nobody would write that!" Well, as-is... maybe. However, the above is the moral equivalent of:

kaniko.yaml:
kind: Task
...
build-dockerfile.yaml:
kind: TaskRun
... # use kaniko
When used with kubectl apply -f dir/ because b comes before k :)

I'm of two minds on this:
1. Tekton has a legit reason to do a Get bypassing the cache to avoid stale reads.

2. Tekton has made it an anti-pattern for users to create Task/Pipeline at the same time as their *Run resources (it is relatively easy to hold wrong, so I'd frankly advise against it).  So while I see the logic in `1.` _if you hold things right_, it is still _possible_ for folks to hold wrong.  I'd go so far as to say that it is _likely_ that some subset of users will hold it wrong, and some subset of those will get burned (right away).  So `1.` is optimizing for an anti-pattern at the expense of additional load on the API Server 🤔
(brace for huge digression)

The way we solve this in Knative is the Configuration / Revision duality. Configurations float forward over time, and stamp out Revisions: immutable snapshots of that Configuration (I like the parallel to Git branches and commits). In our Route resource, we let users point to the Configuration (YOLO, :latest, I'm feeling lucky), or we let folks point to a specific Revision of that Configuration. I used a similar trick for ConfigMaps in this PoC: https://github.com/mattmoor/boo-maps

I think the analog in Tekton would be (different terminology used for clarity, not trying to paint a shed):

TaskDefinition --(stamps out)--> TaskRevision

Heh, nice. In fact, one issue we have today, it to avoid trying to reconcile a mutating Task/Pipeline. Ideally the reconcile should work on the same version of a Task/Pipeline/Condition/PipelineResource through-out the lifecycle of the matching Run resource.

Then in a TaskRun I could reference either a TaskDefinition (YOLO, :latest, I'm feeling lucky) or a TaskRevision (I know what I want!). This also gives a better place to resolve tags to digests and an immutable snapshot to reference and avoid all that inlining 😬, but I digress.

A nuance here is: How do I reference a revision when it hasn't been created? Without that I can't update a TaskDefinition and create a TaskRun in the same kubectl apply! I need to look up the name 🤮 . We have a mediocre solution for that as well, but I've digressed enough 😉

bobcatfish commented 4 years ago

Ideally the reconcile should work on the same version of a Task/Pipeline/Condition/PipelineResource through-out the lifecycle of the matching Run resource

I'm not sure how far along we've gotten with this but afaik we're now storing the Pipeline/Task definition in the status of the Runs, so it would make sense to operate from that in the future and only retrieve the Pipeline/Task on the first reconcile.

Which actually might do a lot to address the efficiency concern as well?

bobcatfish commented 4 years ago

I think it would be fair to retry a fixed number of time, one or two, whatever is needed to ensure the cache is not stale.

We'll have to:

start keeping track of how many times we have queried for each, maybe even backoff
decide what arbitrary fixed # or amount of time to use

I'd like to be certain that this extra complication is worth saving the expense of using Get vs lister, esp. if we can reduce the number of Get calls required by only Getting on the first reconcile.

afrittoli commented 4 years ago

Ideally the reconcile should work on the same version of a Task/Pipeline/Condition/PipelineResource through-out the lifecycle of the matching Run resource

I'm not sure how far along we've gotten with this but afaik we're now storing the Pipeline/Task definition in the status of the Runs, so it would make sense to operate from that in the future and only retrieve the Pipeline/Task on the first reconcile.

Which actually might do a lot to address the efficiency concern as well?

Yeah, that would also fix the issue of a Task/Pipeline changing during reconcile.

vdemeester commented 4 years ago

@afrittoli @bobcatfish @mattmoor so, what is the status of this issue/bug ? :upside_down_face:

mattmoor commented 4 years ago

If it is one-off, it can be fine. However, in principle making any API calls on steady-state reconciliations is actually a huge problem for scaling controllers, which is the main point of the goal around always using the lister.

Generally: If you can do a global resync without any API calls, then 👍 if not, then you have problems 😬

tektoncd / pipeline

The pipeline resource is fetched via client vs. lister in resolver #2740

Expected Behavior

Actual Behavior