tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.45k stars 1.77k forks source link

Tekton Queue. Concurrency #5835

Open marniks7 opened 1 year ago

marniks7 commented 1 year ago

Feature request

We would like to have ability to create many PipelineRuns at once, but execute them one by one (mostly). Sometimes concurrently.

Use case

Tekton can be used as regular pipeline \ workflow engine for all sort of activities.

Use case#1 - Chaos Engineering

Chaos Engineering - create pipelineRuns with some chaos, e.g. deployment restart for each deployment present on the environment. Execute PipelineRun one after another. Sometimes concurrent execution maybe desirable.

Use case#2 - Load Testing by single person

create them at the same time, but run them one after another.

Use case#3 - Load Testing by few people

There could be single environment for Load Testing, but multiple people working on it. In order to control the load runs done by multiple people each person can just create PipelineRun and it will be executed when previous load is finished. Alternative: ask in chat if server is free \ available

Solution (we have)

What we have right now: All pipelineRuns are created in PipelineRunPending state. It is done manually in pipelineRun yaml. We are considering to use Kyverno for automatic approach, but they don't support spec.status field change as of September 2022. We have implemented custom kubernetes controller which handles PipelineRuns with label queue.tekton.dev/name and remove PipelineRunPending state when previous PipelineRun finished.

 metadata:
  labels:
    queue.tekton.dev/name: env-maria-d3hs0

And it is possible to search for PipelineRuns is all namespaces. E.g. in case of namespace per person

metadata:
  annotations:
    queue.tekton.dev/scope: cluster

We didn't implement concurrent ability (e.g. to execute 2 PipelineRuns from the same queue), just because for top use cases we don't need it, but we may implement it in the future.

Other Notes

lbernick commented 1 year ago

Related: https://github.com/tektoncd/pipeline/issues/4903

tekton-robot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

vdemeester commented 1 year ago

/remove-lifecycle stale

tekton-robot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot commented 11 months ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot commented 11 months ago

@tekton-robot: Closing this issue.

In response to [this](https://github.com/tektoncd/pipeline/issues/5835#issuecomment-1773729581): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen` with a justification. >Mark the issue as fresh with `/remove-lifecycle rotten` with a justification. >If this issue should be exempted, mark the issue as frozen with `/lifecycle frozen` with a justification. > >/close > >Send feedback to [tektoncd/plumbing](https://github.com/tektoncd/plumbing). Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
khrm commented 11 months ago

/remove-lifecycle rotten

khrm commented 11 months ago

/reopen

This is part of the Roadmap and quite important. /lifecycle frozen

tekton-robot commented 11 months ago

@khrm: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/tektoncd/pipeline/issues/5835#issuecomment-1773784234): >/reopen > >This is part of the Roadmap and quite important. >/lifecycle frozen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
khrm commented 11 months ago

@vdemeester Please reopen this.

vdemeester commented 11 months ago

/lifecycle frozen

sibelius commented 5 months ago

queueing would be awesome

benoitschipper commented 5 months ago

This would be a great addition, so we could pool pipeline resources from multiple customers and just use queue-ing. Pending state is not desirable as timeouts might cause the pipelines to fail on kubernetes clusters.

sibelius commented 5 months ago

can we do this using resource request and limits ?

benoitschipper commented 5 months ago

can we do this using resource request and limits ?

Yeah you can use requests and limits for this. Problem is that if there are no resources left to run a pipeline it goes into "Pending" which eventually times out. Meaning people will get Pipeline Fails.

If there was something like a queueing system. It would instead never timeout and just wait till there are resources available. This is obviously only for when there are busy periods on the cluster. Hence the request for something like a queueing syste.

I also found that there is a possibility to do something like a lease, but that needed additional self made resources. I want to make this a solution. For our DevOps teams.

We currently give each DevOps team a certain amount of resources to perform pipeline related tasks within a namespace. But that means that a lot of resources are potentially wasted when some DevOps teams are not running their pipelines during some days. We want to instead pool all the resources for our DevOps teams so we can reserve some nodes for pipeline related runtimes. Share the resources and utilize some sort of queueing. It's all efficiency and effectiveness related :)

Hope that makes sense :)

sibelius commented 5 months ago

why PENDING timeouts ?

benoitschipper commented 5 months ago

why PENDING timeouts ?

Kubernetes' scheduler is, due to lack of compute space within a set quota or overall compute space on the cluster, unable to schedule the pod on any node with the requested cpu/men/storage. Making it go in pending mode.

https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/#my-pod-stays-pending

I think it might have to do something with the default timeout, this is from searching the web and within tekton docs:

Reasons for Pending State:

Tekton Timeouts:

Customization:

How to Find Out Specific Timeout Values

Important Considerations:

sibelius commented 3 months ago

can we increase PENDING timeout?

or put these tasks in a "queue"?

like the error CouldNotGetTask

benoitschipper commented 3 months ago

can we increase PENDING timeout?

or put these tasks in a "queue"?

like the error CouldNotGetTask

Queueing mechanisme in Tekton would be great. But this is the thread for the feature request. So this is not a possibility as of yet.

I am not sure if you can increase the "PENDING" state of all of your pods on the cluster but that would be a work around. Not sure there is a setting for it. Maybe setting TerminationGacePeriodSeconds is an option or a command that extends the life of a pod with a sleep 30. But this is not a great solution.

The best option would be for a queueing mechanisme for Tekton with some settings that allow you to manage the queue 🙂