Open ahg-g opened 1 year ago
The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):
key: job-name
operator: NotIn
values:
- my-job
It might be better to keep the "exclusive" semantics:
topologyColocation:
mode: None|SameDomain|ExclusiveDomain
key: rack
enforcement: Strict|BestEffort
The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):
Correct, it is implicit.
It might be better to keep the "exclusive" semantics:
So my thinking is that the "exclusive" semantics are supported using the AntiAffinityPods and AntiAffinityNamespaces selectors since they offer greater flexibility to tune and define exclusiveness (e.g., exclusive against all pods or only other jobs etc.).
The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):
Correct, it is implicit.
This is my worry then. There is a hidden addition to the antiaffinity rules that is not obvious.
This is my worry then. There is a hidden addition to the antiaffinity rules that is not obvious.
We have to have that implicitly added in all cases though
Yes, but IMO it's easier to explain if you say: "exclusive: true" of sorts, compared to mutating matchExpressions that the user provides.
Right, but I feel that we anyway need to provide a knob (a pod selector) to allow users to specify exclusive against what, and by definition it shouldn't be against itself. So having another parameter to say "exclusive" will be redundant in that sense, right?
In that case, I don't think the name AntiAffinityPods
is providing enough context. Could it be misunderstood as a means to implement 1:1 pod per node?
It is a selector on the pods, hence the name; any other suggestions?
I would be tempted to call it jobAntiAffinity
to detach it from the pods themselves. The fact that the pods get the pod anti-affinity rule is an implementation detail. But I could picture arguments against.
I'll stop here. Maybe someone else has a better idea :)
Aldo's proposal is quite appealing in its simplicity:
topologyColocation:
mode: None|SameDomain|ExclusiveDomain
key: rack
enforcement: Strict|BestEffort
When there are needs for more advanced anti-affinity rules, maybe that can be left for classic anti-affinity expressed on the pod template ? (after we have MatchLabelKeys as pointed out in #27 )
Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.
@alculquicondor quick clarifying question about your suggestion:
topologyColocation:
mode: None|SameDomain|ExclusiveDomain
key: rack
enforcement: Strict|BestEffort
Is the below interpretation correct or am I misunderstanding?
ExclusiveDomain
= assign 1 job per topology domain (e.g. 1 job per nodepool)
SameDomain
= assign all jobs in one topology domain (e.g. all jobs on one node pool)
Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.
We don't have a canonical way of defining hierarchy using node labels, so I don't think we can reliably expose a max-distance API unless we predefine a hierarchal topology.
SameDomain = assign all jobs in one topology domain (e.g. all jobs on one node pool)
No, it means the Job itself is colocated in a single domain (node pool), but it can coexist with others.
If we are to go with an explicit enum, then it would probably be: Colocated|ColocatedAndExclusive
When there are needs for more advanced anti-affinity rules, maybe that can be left for classic anti-affinity expressed on the pod template ? (after we have MatchLabelKeys as pointed out in https://github.com/kubernetes-sigs/jobset/issues/27 )
I don't think it is doable in a reliable and non-surprising way because we need to have a selector term that prevents the job from excluding itself. The API in the main post make that selector term explicit, attached to the specified topology and the namespace where this applies.
Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.
We don't have a canonical way of defining hierarchy using node labels, so I don't think we can reliably expose a max-distance API unless we predefine a hierarchal topology.
But, it can be done by allowing users to define a list of colocation rules, with the outermost topology being strict while the others being preferred:
colocation:
- topologyKey: rack
mode: BestEffort
- topologyKey: zone
mode: Strict
if we want to support more levels, then the API should allow setting pod-affinity weights (with higher weights for the inner most ones as you pointed out offline). At this point, the colocation API drifts closer to pod-affinity, so its value becomes less obvious compared to just using pod-affinity directly :)
so its value becomes less obvious compared to just using pod-affinity directly :)
Yes, but at that point it's better to enhance the pod spec to support what we need. In the meantime, we jobset can support the minimal requirement: probably just Exclusive and Strict.
it can be done by allowing users to define a list of colocation rules with the outermost topology being strict while the others being preferred:
Exactly. Btw. since BestEffort/Strict is only the choice for the outermost (and inner are always BestEffort), mode could be outside the list:
colocation:
topologyKey: [rack, zone]
mode: Strict
UC1: Exclusive 1:1
I'm not sure how the list should support exclusive, though. I guess exclusive makes most sense when there is one.
Btw. why would jobs want exclusive actually? Because of some noisy neighbors? I suspect what we really want is an all-or-nothing mechanism to avoid deadlocks and exclusive is just a workaround for that? Like "I would be fine to share the rack with someone if we both fit, but I want to avoid deadlocks when two of us fit partially -- so I will keep the rack exclusive as a workaround".
At this point, the colocation API drifts closer to pod-affinity, so its value becomes less obvious compared to just using pod-affinity directly
I didn't follow this. The value would be much simpler API for the user. Implementation would set pod affinities and weights, but the user would only deal with 'colocation' API not with pod affinities, right?
That said, it also makes a lot of sense if you decide to focus on something simpler in the beginning.
Btw. why would jobs want exclusive actually? Because of some noisy neighbors?
Does this mean one pod per machine? Or exclusive use of some machine? There are many uses for our workflows for which we would absolutely not want to share a machine as it would impact performance.
This was addressed in #309 and #342 and included in release v0.3.0 🥳
Actually, we didn't actually change the API (it's still an annotation), but just the implementation of it, to improve scheduling throughput for large scale training. Feel free to re-open this if you want to explore this further.
This issue is about having a proper API, not the optimizations.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The existing
spec.replicatedJobs[*].exclusive
API allows users to express that the jobs have 1:1 assignment to a domain. For example, one and only one job per rack.This is currently expressed as:
This API is simple compared to the complex pod affinity and anti-affinity rules that a user would have needed to add to achieve the same semantics (see https://github.com/kubernetes-sigs/jobset/issues/27).
However, one issue with this API is that it is limited to one use case (exclusive 1:1 placement); I would like to discuss making it a bit more generic to support the following use cases without losing simplicity:
Use Cases
UC1: Exclusive 1:1 job to domain assignment (what the current API offer) UC2: Each job is colocated in one domain, but not necessarily exclusive to it (i.e., more than on job could land on the same domain) UC3: Allow users to express either preference or required assignment?
What other use cases?
API options
UC3 can be supported by extending the existing API; however, the same can't be said with regards to UC2 since the type name "Exclusive" doesn't lend itself to that UC, even if we do https://github.com/kubernetes-sigs/jobset/issues/40
Option 1
UC1
UC2
what other options?