Open ahg-g opened 2 years ago
FYI @terrytangyuan
Also, extracted from a comment in https://bit.ly/kueue-apis (can't find the person's github)
A compromise might be a way of submitting a job, but have it "paused" so that the workflow manager can unpause it after its deps have been met, but the job still can wait in line in the queue so it doesn't add a lot of wall clock time. The scheduler would ignore any paused jobs until they are unpaused?
The idea is to allow for a dependent job to jump to the head of the queue when the dependencies are met.
Yes, but it essentially only jumps to the head of the line if it already was at the head of the line.
I guess I'll have to read through the design doc for queue APIs in order to understand the use case better here. Any thoughts on what the integration looks like and how the two interoperate with each other?
Consider there to be two components. a queue, and a scheduler. The queue is where jobs wait in line. A scheduler picks entries to work on at the head of the line.
Sometimes in the real world, its a family waiting in line. One member goes off to use the bathroom. If they are not back by the time its their turn, they usually say, "let the next folks go, we're not ready yet". The scheduler in this case just ignores that entry and goes to the next entry in the queue. The option to allow jobs to be "not ready yet, don't schedule me, but still queue me" could be interesting to various workflow managers.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Would a similar integration like Argo and Volcano work in this case?
https://github.com/volcano-sh/volcano/blob/master/example/integrations/argo/20-job-DAG.yaml
Not really. That seems to be creating a different job for each step of the workflow. Then, each job enters the queues only after the previous step has finished. This can already be accomplished with Kueue and batch/v1.Job.
We would like to enhance the experience roughly as described here: https://github.com/kubernetes-sigs/kueue/issues/74#issuecomment-1051285404
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Hi, I am trying to figure out if I could use Kueue for queueing Tekton PipelineRuns (more info on tekton at tekton.dev/docs). From reading bit.ly/kueue-apis, it seems like Kueue is going to have separate controllers that create Workload objects for different types of workloads (although I'm not sure if that's the case yet).
Would it be reasonable to write a separate controller that creates Workload objects for pending PipelineRuns, and starts the PipelineRuns when the workload is admitted by the queue? I'm not sure if this is possible because it seems like kueue somehow mutates the workloads' node affinity directly, and the relationship between PipelineRuns and pod specs doesn't work in quite the same way as between Jobs and pod specs.
I'm also curious if it's possible to create a queue that is just based on count of running objects rather than their compute resource requirements.
More details on what I'm trying to do: https://github.com/tektoncd/community/blob/main/teps/0132-queueing-concurrent-runs.md
it seems like Kueue is going to have separate controllers that create Workload objects for different types of workloads (although I'm not sure if that's the case yet).
These controllers can live in the Kueue repo, the tekton repo or a new repo altogether. We currently have a controller for kubeflow MPIJob in the kueue repo. If the Tekton community is open to have this integration, we can discuss where is the best place to put it.
Would it be reasonable to write a separate controller that creates Workload objects for pending PipelineRuns, and starts the PipelineRuns when the workload is admitted by the queue?
Depends on what you want. When talking about workflows, there are two possibilities: (a) queue the entire workflow or (b) queue the steps.
I'm not sure if this is possible because it seems like kueue somehow mutates the workloads' node affinity directly, and the relationship between PipelineRuns and pod specs doesn't work in quite the same way as between Jobs and pod specs.
Injecting node affinities is the mechanism to support fungibility (example: this job can run on ARM or x86, let kueue decide to run it where there is still quota). If this is not something that matters to you, you can not create flavors.
I'm also curious if it's possible to create a queue that is just based on count of running objects rather than their compute resource requirements.
Kueue is a quota-based system. Currently it uses pod resource requests and we plan to add number of pods #485. What kind of object would make sense to count in Tekton? I would expect that there should be resource requests somewhere.
I'll comment more when I finish reading the doc above. Thanks for sharing :)
cc @kerthcet
Thanks for your response!
These controllers can live in the Kueue repo, the tekton repo or a new repo altogether. We currently have a controller for kubeflow MPIJob in the kueue repo. If the Tekton community is open to have this integration, we can discuss where is the best place to put it.
Still in the early exploration phase, but looking forward to discussing more what would work!
Kueue is a quota-based system. Currently it uses pod resource requests and we plan to add number of pods #485. What kind of object would make sense to count in Tekton? I would expect that there should be resource requests somewhere.
Tekton uses PipelineRuns, which are DAGs of TaskRuns, and each TaskRun corresponds to a pod. One of our use cases is basically just to avoid overwhelming a kube cluster, in which case queueing based on resource requirements would be useful. However, there are some wrinkles with how we handle resource requirements, since we have containers running sequentially in a pod rather than in parallel, so the default k8s assumption that pod resource requirements are the sum of container resource requirements doesn't apply. For this reason, queueing based on TaskRun or PipelineRun count may be simpler for us. Since TaskRuns correspond to pods, queueing based on pod count would solve the TaskRun use case at least.
We also have some use cases that would probably need to be met in Tekton with a wrapper API (e.g. "I want to have only 5 PipelineRuns at a time of X Pipeline that communicates with a rate-limited service"; "I want to have only one deployment PipelineRun running at a time", etc). If we could use Kueue to create a queue of at most X TaskRuns, we'd be in good shape to design something in Tekton meeting these needs.
Since TaskRuns correspond to pods, queueing based on pod count would solve the TaskRun use case at least.
Yes, the pod count would help. But I would encourage users to also add pod requests. This is particularly important for HPC workflows. You might want dedicated CPUs and accelerators.
I agree that it wouldn't make sense to queue at a lower level than TaskRuns.
You are welcome to add a topic to our WG Batch meetings if you want to show your design proposals for queuing workflows.
https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit
One feedback for this is we have Tekton+ArgoCD as our CICD pipelines, for cost effectiveness, we deploy tekton together with other application services(non-productive), what will happen is we will run into insufficient resources when there're a lot of CI runs. So we have to isolate them. Queueing is important for tekton as well I think.
We have waitForPodsReady which will wait until the previous job has enough pods running, I think we can expand this to like pendingForTargetQuantity, for job, it will still return the pod number, but for tekton, it will wait for target number of pipelineRuns/taskRuns, but we need to implement the suspend in pipelineRun/taskRun.
I think resource management is great for tekton, but if no, we can also make it out by watching the pipelineRun/taskRun amount. But this needs a refactor to kueue for now resources are required. Just for brainstorming.
Another concern is about preemption, I think it will be dangerous for tekton in some cases. Like deploying applications.
@alculquicondor @ahg-g I added https://github.com/argoproj/argo-workflows/issues/12363 to track and hopefully would attract more contributors to work on this.
@terrytangyuan FYI: we're working on https://github.com/kubernetes/kubernetes/issues/121681 for workflow support.
It is possible to use pod-level integration using the Plain Pods approach.
We use this config snippet (from kueue-manager-config) to integrate Argo Workflows into Kueue:
integrations:
frameworks:
- "pod"
podOptions:
# You can change namespaceSelector to define in which
# namespaces kueue will manage the pods.
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: [ kube-system, kueue-system ]
# Kueue uses podSelector to manage pods with particular
# labels. The default podSelector will match all the pods.
podSelector:
matchExpressions:
- key: workflows.argoproj.io/completed
operator: In
values: [ "false", "False", "no" ]
This configuration adds a scheduling gate to each Argo Workflows pod and will only release it once there is quota available.
It is possible to use pod-level integration using the Plain Pods approach.
We use this config snippet (from kueue-manager-config) to integrate Argo Workflows into Kueue:
integrations: frameworks: - "pod" podOptions: # You can change namespaceSelector to define in which # namespaces kueue will manage the pods. namespaceSelector: matchExpressions: - key: kubernetes.io/metadata.name operator: NotIn values: [ kube-system, kueue-system ] # Kueue uses podSelector to manage pods with particular # labels. The default podSelector will match all the pods. podSelector: matchExpressions: - key: workflows.argoproj.io/completed operator: In values: [ "false", "False", "no" ]
This configuration adds a scheduling gate to each Argo Workflows pod and will only release it once there is quota available.
Thanks for putting an example here :)
Yes, that's right. The plain pod integration could potentially support the ArgoWorkflow. However, the plain pod integration doesn't support all kueue features, such as partial admission. So the native ArgoWorkflkow support would be worth it.
Regarding the features not supported in the plain pod integration, please see for more details: https://github.com/kubernetes-sigs/kueue/tree/main/keps/976-plain-pods#non-goals
Oh that's cool. How do you set up the queue-name in the Pods?
I'm not familiar with Argo. Does it have support for pods working in parallel or pods that all need to start together?
Another thing to note is that behavior you are getting is that Pods are created when their dependencies complete. Meaning that, in a busy cluster, a workflow might be spending too much time waiting in the queue for each step. Is this acceptable?
It's probably acceptable for some users. Would you be willing to write a tutorial for the kueue website?
Oh that's cool. How do you set up the queue-name in the Pods?
You can use either spec.template[].metadata or spec.podMetadata to define a queue.
I'm not familiar with Argo. Does it have support for pods working in parallel or pods that all need to start together?
Argo supports parallel execution of pods, and those pods are only created when each "node" of the workflow is ready to run. This type of integration simply prevents each pod from executing until they pass Kueue's admission checks.
Another thing to note is that behavior you are getting is that Pods are created when their dependencies complete. Meaning that, in a busy cluster, a workflow might be spending too much time waiting in the queue for each step. Is this acceptable?
I'm still waiting to see how well it works. I don't expect the wait time between nodes to be a problem, but a backlog of partially complete workflows may become problematic.
Most of the use cases revolve around ETL nodes followed by process nodes and vice-versa. Depending on how the queues are configured, I could end up with too many partially complete workflows that take up ephemeral resources.
It's probably acceptable for some users. Would you be willing to write a tutorial for the kueue website?
Sure.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Is there any progress for supporting argo/tekon workflows?
I don't think anyone has followed through with it. Would you like to propose something? I think we might require changes in both projects, but at least the Argo community is in favor of doing something: https://github.com/argoproj/argo-workflows/issues/12363
@alculquicondor I'm confused. Isn't it possible to support argo-workflows indirectly through pod integration?
It is indeed possible. But a tighter integration, with atomic admission, would be beneficial.
If the user want to run the step which contains multi pods only when all pods can run, we may need some methods to know which pods should be in the same workload. So only pod integration may not enough.
cc @Zhuzhenghao Discussion about integrating Kueue with tekton.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
https://github.com/argoproj/argo-workflows/issues/12363 has 22 upvotes. We just need someone to drive this.
@terrytangyuan Hi, is there any conclusion about what exactly to suspend (the entire workflow or the layer)?
We developed two different ways to help the queuing of workflows in our environment.
And the plain pods suspend is also an available method. I can write a simple KEP to tell the advantages and disadvantages for each method and track the discussion. @alculquicondor @tenzen-y Hi, is there anyone working on this?
@KunWuLuan is the mentioned controller opensourced? Thanks :)
We developed two different ways to help the queuing of workflows in our environment.
- Users can define the max resources on the entire workflow. During the execution of workflow, the total resources can not exceed the admission.
- Integrate a controller to convert a workflow to insert a suspend template before each layer. And create a workload for each suspend layer.
And the plain pods suspend is also an available method. I can write a simple KEP to tell the advantages and disadvantages for each method and track the discussion. @alculquicondor @tenzen-y Hi, is there anyone working on this?
@KunWuLuan Thank you for tackling this issue.
GenericJob
interface similar to batch/v1 Job and other Jobs.As a first step, it would be a great improvement if you could provide documents and examples for Plain Pod Integration + ArgoWorkflows.
What's the layer means here? One step?
If so, I think maybe it's possible to create all workloads for all the steps(parallel steps as one workload) and suspend them all. Once a workload finishes, allow the next one, I think the controller knows the dependence.
However, how can we distinguish with the injected suspend
vs used configured suspend
.
I think the approach 1) can be a simple start. Anyway, glad to see the KEP.
Note that someone started a PR to document how to use the plain pods integration with argo https://github.com/kubernetes-sigs/kueue/pull/1545, but they abandoned it.
Regardless, I would be interested in a more robust support at the layer level. See this comment for my high level proposal https://github.com/argoproj/argo-workflows/issues/12363#issuecomment-1870421459
plain pod.... could that work with gitlab runner jobs too? the lack of scheduling there has been a pain.
@KunWuLuan is the mentioned controller opensourced? Thanks :)
The controller is not opensourced yet.
Does this indicate a new API object or field? or Reusing existing API objects or fields?
Yes, we introduced a specific key in workflow's annotations like
annotations:
min-resources: |
cpu: 5
memory: 5G
Does this indicate a part of Job integration controller implemented GenericJob interface similar to batch/v1 Job and other Jobs.
Yes we deployed a Job integration controller which contains a controller to create CR like workload and a controller to inject suspend template to original workflow.
As a first step, it would be a great improvement if you could provide documents and examples for Plain Pod Integration + ArgoWorkflows.
On problem, working on it. : )
Yes we deployed a Job integration controller which contains a controller to create CR like workload and a controller to inject suspend template to original workflow.
That seems useful, but annotations are not a sustainable API. Argo folks were in favor of doing a proper integration, so we can probably change their API to accommodate the needs of the integration.
But again, something at the layer level is probably better.
Yes we deployed a Job integration controller which contains a controller to create CR like workload and a controller to inject suspend template to original workflow.
That seems useful, but annotations are not a sustainable API. Argo folks were in favor of doing a proper integration, so we can probably change their API to accommodate the needs of the integration.
But again, something at the layer level is probably better.
I think that we want to support the creation of Workload at the layer level as well, and we want to push all Workload sequentially. This layer-level approach allows us to prevent wasting resources for the entire workflow.
But, I think that we can evaluate the layer-level approach during the KEP (https://github.com/kubernetes-sigs/kueue/pull/2976).
@alculquicondor @tenzen-y Introduced a KEP to discuss the advantages and constraints of three different granularity levels for supporting workflows , and three approaches for supported workflows at the layer level are also proposed.
@terrytangyuan If you have time, please also have a look, thanks very much.
Awesome! I'll share the proposal around the Argo Workflows community as well.
This is lower priority than https://github.com/kubernetes-sigs/kueue/issues/65, but it would be good to have an integration with a workflow framework.
Argo supports the suspend flag, the tricky part is that suspend is for the whole workflow, meaning a QueuedWorkload would need to represent the resources of the whole workflow all at once.
Ideally Argo should create jobs per sequential step, and then resource reservation happens one step at a time.