Tekton Queue. Concurrency

marniks7 commented 1 year ago

Feature request

We would like to have ability to create many PipelineRuns at once, but execute them one by one (mostly). Sometimes concurrently.

Use case

Tekton can be used as regular pipeline \ workflow engine for all sort of activities.

Use case#1 - Chaos Engineering

Chaos Engineering - create pipelineRuns with some chaos, e.g. deployment restart for each deployment present on the environment. Execute PipelineRun one after another. Sometimes concurrent execution maybe desirable.

Use case#2 - Load Testing by single person

Create pipelineRun with load equal to 10 users
Create pipelineRun with load equal to 20 users
Create pipelineRun with load equal to 30 users

create them at the same time, but run them one after another.

Use case#3 - Load Testing by few people

There could be single environment for Load Testing, but multiple people working on it. In order to control the load runs done by multiple people each person can just create PipelineRun and it will be executed when previous load is finished. Alternative: ask in chat if server is free \ available

Solution (we have)

What we have right now: All pipelineRuns are created in PipelineRunPending state. It is done manually in pipelineRun yaml. We are considering to use Kyverno for automatic approach, but they don't support spec.status field change as of September 2022. We have implemented custom kubernetes controller which handles PipelineRuns with label queue.tekton.dev/name and remove PipelineRunPending state when previous PipelineRun finished.

 metadata:
  labels:
    queue.tekton.dev/name: env-maria-d3hs0

And it is possible to search for PipelineRuns is all namespaces. E.g. in case of namespace per person

metadata:
  annotations:
    queue.tekton.dev/scope: cluster

We didn't implement concurrent ability (e.g. to execute 2 PipelineRuns from the same queue), just because for top use cases we don't need it, but we may implement it in the future.

Other Notes

I saw this one https://github.com/tektoncd/experimental/tree/main/concurrency. I guess it could be extended to support those cases.
I saw different tickets about concurrency around, but didn't see test cases like those.

lbernick commented 1 year ago

tekton-robot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

vdemeester commented 1 year ago

/remove-lifecycle stale

tekton-robot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

@tekton-robot: Closing this issue.

In response to [this](https://github.com/tektoncd/pipeline/issues/5835#issuecomment-1773729581): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen` with a justification. >Mark the issue as fresh with `/remove-lifecycle rotten` with a justification. >If this issue should be exempted, mark the issue as frozen with `/lifecycle frozen` with a justification. > >/close > >Send feedback to [tektoncd/plumbing](https://github.com/tektoncd/plumbing). Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

khrm commented 1 year ago

/remove-lifecycle rotten

khrm commented 1 year ago

/reopen

This is part of the Roadmap and quite important. /lifecycle frozen

tekton-robot commented 1 year ago

@khrm: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/tektoncd/pipeline/issues/5835#issuecomment-1773784234): >/reopen > >This is part of the Roadmap and quite important. >/lifecycle frozen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

khrm commented 1 year ago

@vdemeester Please reopen this.

vdemeester commented 1 year ago

/lifecycle frozen

sibelius commented 6 months ago

queueing would be awesome

benoitschipper commented 6 months ago

This would be a great addition, so we could pool pipeline resources from multiple customers and just use queue-ing. Pending state is not desirable as timeouts might cause the pipelines to fail on kubernetes clusters.

sibelius commented 6 months ago

can we do this using resource request and limits ?

benoitschipper commented 6 months ago

can we do this using resource request and limits ?

Yeah you can use requests and limits for this. Problem is that if there are no resources left to run a pipeline it goes into "Pending" which eventually times out. Meaning people will get Pipeline Fails.

If there was something like a queueing system. It would instead never timeout and just wait till there are resources available. This is obviously only for when there are busy periods on the cluster. Hence the request for something like a queueing syste.

I also found that there is a possibility to do something like a lease, but that needed additional self made resources. I want to make this a solution. For our DevOps teams.

We currently give each DevOps team a certain amount of resources to perform pipeline related tasks within a namespace. But that means that a lot of resources are potentially wasted when some DevOps teams are not running their pipelines during some days. We want to instead pool all the resources for our DevOps teams so we can reserve some nodes for pipeline related runtimes. Share the resources and utilize some sort of queueing. It's all efficiency and effectiveness related :)

Hope that makes sense :)

sibelius commented 6 months ago

why PENDING timeouts ?

benoitschipper commented 6 months ago

why PENDING timeouts ?

Kubernetes' scheduler is, due to lack of compute space within a set quota or overall compute space on the cluster, unable to schedule the pod on any node with the requested cpu/men/storage. Making it go in pending mode.

https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/#my-pod-stays-pending

I think it might have to do something with the default timeout, this is from searching the web and within tekton docs:

Reasons for Pending State:

Insufficient Nodes: Your Kubernetes cluster lacks the physical nodes to accommodate the pod's CPU or memory requirements.
Resource Quotas: You might have resource quotas in place that limit the total number of pods or the amount of resources that can be used in a specific namespace.

Tekton Timeouts:

PipelineRun Timeout: Each Tekton PipelineRun has a configurable timeout. If the pod remains in "Pending" state beyond this timeout, the PipelineRun will fail with an error indicating that it timed out.
Default Timeout: Tekton has a global default timeout (usually 60 minutes) that acts as a catch-all if you haven't specified a PipelineRun-specific timeout.
Task Timeouts: You can even define timeouts at the individual Task level within your pipeline.

Customization:

Overriding Defaults: You can change the global default timeout by adjusting the default-timeout-minutes field in your Tekton configuration (config/config-defaults.yaml).
Specific Timeouts: Set more tailored timeouts at the PipelineRun and individual Task levels to match the expected execution time of your pipeline steps.

How to Find Out Specific Timeout Values

Examine PipelineRun: Use kubectl describe pipelinerun to see the timeout configured for that specific instance.
Pipeline Definition: If no timeout is set on the PipelineRun, check your Pipeline definition using kubectl describe pipeline .
Cluster-Wide Default: If there are no timeouts in either of the above, the cluster-wide default in Tekton's configuration applies (default apparently 60 min)

Important Considerations:

Timeouts are crucial to prevent stalled pipelines from consuming resources indefinitely.
If the resource shortage is temporary, the pod might automatically start running once enough resources become available (before timing out).

sibelius commented 4 months ago

can we increase PENDING timeout?

or put these tasks in a "queue"?

like the error CouldNotGetTask

benoitschipper commented 4 months ago

can we increase PENDING timeout?

or put these tasks in a "queue"?

like the error CouldNotGetTask

Queueing mechanisme in Tekton would be great. But this is the thread for the feature request. So this is not a possibility as of yet.

I am not sure if you can increase the "PENDING" state of all of your pods on the cluster but that would be a work around. Not sure there is a setting for it. Maybe setting TerminationGacePeriodSeconds is an option or a command that extends the life of a pod with a sleep 30. But this is not a great solution.

The best option would be for a queueing mechanisme for Tekton with some settings that allow you to manage the queue 🙂

tektoncd / pipeline