tektoncd / triggers

Event triggering with Tekton!
Apache License 2.0
552 stars 416 forks source link

House keeper for Pipeline related stuff #64

Closed vincent-pli closed 3 years ago

vincent-pli commented 5 years ago

Expected Behavior

When event received, deployment of EventListener will create at least one PipelineResource and one Pipelinerun, then those stuff will get out of hand of Trigger. That’s caused a problem, after period, there will be tens of hundreds pipeline related stuff leaved.

So I think we need setup garbage disposal/house keeper mechanism for the situation. I saw https://github.com/tektoncd/triggers/issues/48 , but it need delete EventListener to clean all stuff. I think EventListener is design for long term running.

Actual Behavior

Currently, nothing for this

vtereso commented 5 years ago

I agree that resource build up could/would become cumbersome since we are generating resources based on events. Triggers is enabling users to create these resources simply, which only exacerbates this Tekton issue. Maybe an option could be added to the PipelineRuns/TaskRuns, where PipelineResources can be deleted after runs? In terms of handling arbitrary resource deletion, I don't know how much better we can do than EventListener metav1.OwnerReference-based removal (or some of the sort) or having some time based deletion.

vincent-pli commented 5 years ago

Could we add metav1.OwnerReference of the resources (except Pipelinerun) to a Pipelinerun. I believe there is at least one PipelineRun, right?

So the problem is when to delete the Pipelinerun

vtereso commented 5 years ago

PipelineResources are created ahead of time compared to PipelineRuns (since they do not support being embedded currently?), where likely only some should have owner references. If a resource doesn't have a specific SHA/revisions/tags, it's potentially reusable so deleting it might be bad? That aside, I guess it is possible to add a metav1.OwnerReference to referenced PipelineResources?

vtereso commented 5 years ago

This has also been discussed here: https://github.com/tektoncd/pipeline/issues/544

vincent-pli commented 5 years ago

Proposal:

  1. Design new kind:TriggerExecutor

    apiVersion: tekton.dev/v1alpha1
    kind: TriggerExecutor
    metadata:
    name: some-executor
    spec:
    pipelineruns []*corev1.ObjectReference

    TriggerExecutor will be create in controller of EventListener. and set Pipelinerun ref in it for all Pipelinerun created in this event received.

  2. Enhance EventTemplate

    type TriggerResourceTemplate struct {
        DeleteAfterComplete bool `json:"deleteAfterComplete,omitempty"`
    json.RawMessage `json:",inline"`
    }

    When creation, all resources in EventTemplate should:

    • Set owner to EventListener
    • If DeleteAfterComplete is true, set owner to TriggerExecutor.

In controller of TriggerExecutor, it will watch both TriggerExector and Pipelinerun

vtereso commented 5 years ago

I believe the next step being taken to address the resource proliferation is to embed PipelineResources within PipelineRuns, which will implicitly set the owner reference. I see here that job controllers, have taken a ttl approach to deletion as well. @bobcatfish seems like it would be relatively simple?

vincent-pli commented 5 years ago

Yes, but beside the embed case, pipeline also support original way. and we also need to handle arbitrary resource deletion.

This is a very draft solution, but I think we need consider the issue now, always welcome suggestion.

vtereso commented 5 years ago

For generic garbage collection, the Kubernetes documentation only seems to mention owner references and cron job based deletion.

vincent-pli commented 5 years ago

Yes, the upper proposal is based on the owner references. For the cron job, hmm, to be honest, is a choice, but not attractive...

ncskier commented 4 years ago

I agree that we don't want to use the EventListener as the OwnerReference, because the EventListener is supposed to persist through multiple triggerings of resource creation.

So, I think that one cleanup mechanism could be to give every triggered resource a label using the $(uid) or an "event id." Then when a user wants to delete all of these resources, they can delete all resources with the $(uid) or "event id" label.

Another weirder option which might be a little heavy would be to create a new Event CustomResource. This Event resource would be created with all of the triggered resources, and would be the owner of all the triggered resources. So, if a user wanted to clean up all of the resources, they could delete the Event resource.

@vincent-pli do you think either of these cleanup options are adequate?

vincent-pli commented 4 years ago

@ncskier Thanks for your comments.

Please take a look at my previous comments: https://github.com/tektoncd/triggers/issues/64#issuecomment-524796225 It's similar to your second option. I want to summary all the solutions here for further discussion:

  1. Add label to resources with the content of $(uid) or event id, then let user delete resource by label. It's simple for implements and use, but either $(uid) or event id is not good choice for label:

    • $(uid) it totally random, can not use regex for batch delete.
    • event id depends on event generator, maybe repeating.
  2. Set owner reference to EventListener The use case should be: user want to recreate the EventListener and want clear everything related to it. Its disadvantage is EventListener is design for long term running and support multiple triggerings of resource creation, cannot use it to runtime garbage collection.

  3. Create new crd, maybe named TriggerExecutor and make it as owner of all resource created in this event received. and deleted somewhen.

    • Manually
    • Monitor the status of Pipelinerun and delete TriggerExecutor if done.

I think the three options are not conflict, actually I think we can implement option 1 and 2 then document it. For option 3, to be honest, I like this one, but as @ncskier said, a little heavy, but could really resolve the house keeping problem. BTW, I think we can make it as option to let user to make decision.

vtereso commented 4 years ago

@vincent-pli Good point. One thing that may help however is that all labels cascade from the EventListener and as well as the additional label on the reconciled resources: labels["app"] = el.Name. This way, any of these labels could be used for deletion.

ncskier commented 4 years ago

Good point @vtereso, you could just use the label app=eventlistenername to delete all of the resources associated with that EventListener 👍

vincent-pli commented 4 years ago

@vtereso I think it's a good idea, I will update the #96 For the option 3, I think we need more further discussion.

imjasonh commented 4 years ago

I think we want to consider some solution that persists the state/history of runs off-cluster, at which point it's safe to delete it from the cluster (or mark it for eventual later deletion).

The system for persistent off-cluster storage/indexing isn't designed yet (https://github.com/tektoncd/pipeline/issues/453), but I believe it would solve this and many other problems in a CI/CD system, like tracking the success of a continuous build over time, searching for provenance of an image or deployment, etc.

This doesn't seem like a triggering-specific problem; PipelineRuns shouldn't be able to continuously accrete on a cluster indefinitely, regardless of whether they were triggered or manually run. I'd like to move this to the Pipelines repo since this is an issue with PipelineRuns/TaskRuns.

(However, that being said, if triggers are going to be capable of creating arbitrary K8s resources on-demand -- e.g., Jobs, ConfigMaps -- then triggering will have come up with some way of reaping those when it can determine they're unneeded. I don't think we can solve that the same way we can for PipelineRuns, since we know a lot more about the semantics of PipelineRuns.)

tekton-robot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.