A different API to express colocation and exclusiveness

ahg-g commented 1 year ago

The existing spec.replicatedJobs[*].exclusive API allows users to express that the jobs have 1:1 assignment to a domain. For example, one and only one job per rack.

This is currently expressed as:

exclusive
  topologyKey: rack

This API is simple compared to the complex pod affinity and anti-affinity rules that a user would have needed to add to achieve the same semantics (see https://github.com/kubernetes-sigs/jobset/issues/27).

However, one issue with this API is that it is limited to one use case (exclusive 1:1 placement); I would like to discuss making it a bit more generic to support the following use cases without losing simplicity:

Use Cases

UC1: Exclusive 1:1 job to domain assignment (what the current API offer) UC2: Each job is colocated in one domain, but not necessarily exclusive to it (i.e., more than on job could land on the same domain) UC3: Allow users to express either preference or required assignment?

What other use cases?

API options

UC3 can be supported by extending the existing API; however, the same can't be said with regards to UC2 since the type name "Exclusive" doesn't lend itself to that UC, even if we do https://github.com/kubernetes-sigs/jobset/issues/40

Option 1

type ReplicatedJob struct {
  ....
  ColocationPolicy *ColocationPolicy
}

type ColocationPolicy string

ColocationPolicyStrict ColocationPolicy = "Strict"
ColocationPolicyBestEffort ColocationPolicy = "BestEffort"

// Colocation describes how the pods of a job will be colocated.
type Colocation struct {
     Policy ColocationPolicy
     // Defines the topology on which the pods of a job replica will be colocated.
     TopologyKey string
     // Defines which other pods in which namespaces a job replica can not be colocated with.
     AntiAffinityPods *metav1.LabelSelector `json:"podSelector,omitempty"`
     AntiAffinityNamespaces *metav1.LabelSelector `json:"podSelector,omitempty"`
}

UC1

colocation:
  topologyKey: rack
  antiAffinityPods:
    matchExpressions:
    - key: job-name
      operator: Exists

UC2

colocation:
  topologyKey: rack

what other options?

alculquicondor commented 1 year ago

The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):

key: job-name
operator: NotIn
values:
- my-job

It might be better to keep the "exclusive" semantics:

topologyColocation:
  mode: None|SameDomain|ExclusiveDomain
  key: rack
  enforcement: Strict|BestEffort

ahg-g commented 1 year ago

The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):

Correct, it is implicit.

It might be better to keep the "exclusive" semantics:

So my thinking is that the "exclusive" semantics are supported using the AntiAffinityPods and AntiAffinityNamespaces selectors since they offer greater flexibility to tune and define exclusiveness (e.g., exclusive against all pods or only other jobs etc.).

alculquicondor commented 1 year ago

The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):

Correct, it is implicit.

This is my worry then. There is a hidden addition to the antiaffinity rules that is not obvious.

ahg-g commented 1 year ago

This is my worry then. There is a hidden addition to the antiaffinity rules that is not obvious.

We have to have that implicitly added in all cases though

alculquicondor commented 1 year ago

Yes, but IMO it's easier to explain if you say: "exclusive: true" of sorts, compared to mutating matchExpressions that the user provides.

ahg-g commented 1 year ago

Right, but I feel that we anyway need to provide a knob (a pod selector) to allow users to specify exclusive against what, and by definition it shouldn't be against itself. So having another parameter to say "exclusive" will be redundant in that sense, right?

alculquicondor commented 1 year ago

In that case, I don't think the name AntiAffinityPods is providing enough context. Could it be misunderstood as a means to implement 1:1 pod per node?

ahg-g commented 1 year ago

It is a selector on the pods, hence the name; any other suggestions?

alculquicondor commented 1 year ago

I would be tempted to call it jobAntiAffinity to detach it from the pods themselves. The fact that the pods get the pod anti-affinity rule is an implementation detail. But I could picture arguments against. I'll stop here. Maybe someone else has a better idea :)

qelo commented 1 year ago

Aldo's proposal is quite appealing in its simplicity:

topologyColocation:
  mode: None|SameDomain|ExclusiveDomain
  key: rack
  enforcement: Strict|BestEffort

When there are needs for more advanced anti-affinity rules, maybe that can be left for classic anti-affinity expressed on the pod template ? (after we have MatchLabelKeys as pointed out in #27 )

Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.

danielvegamyhre commented 1 year ago

@alculquicondor quick clarifying question about your suggestion:

topologyColocation:
  mode: None|SameDomain|ExclusiveDomain
  key: rack
  enforcement: Strict|BestEffort

Is the below interpretation correct or am I misunderstanding? ExclusiveDomain = assign 1 job per topology domain (e.g. 1 job per nodepool) SameDomain = assign all jobs in one topology domain (e.g. all jobs on one node pool)

ahg-g commented 1 year ago

Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.

We don't have a canonical way of defining hierarchy using node labels, so I don't think we can reliably expose a max-distance API unless we predefine a hierarchal topology.

ahg-g commented 1 year ago

SameDomain = assign all jobs in one topology domain (e.g. all jobs on one node pool)

No, it means the Job itself is colocated in a single domain (node pool), but it can coexist with others.

If we are to go with an explicit enum, then it would probably be: Colocated|ColocatedAndExclusive

ahg-g commented 1 year ago

When there are needs for more advanced anti-affinity rules, maybe that can be left for classic anti-affinity expressed on the pod template ? (after we have MatchLabelKeys as pointed out in https://github.com/kubernetes-sigs/jobset/issues/27 )

I don't think it is doable in a reliable and non-surprising way because we need to have a selector term that prevents the job from excluding itself. The API in the main post make that selector term explicit, attached to the specified topology and the namespace where this applies.

ahg-g commented 1 year ago

Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.

We don't have a canonical way of defining hierarchy using node labels, so I don't think we can reliably expose a max-distance API unless we predefine a hierarchal topology.

But, it can be done by allowing users to define a list of colocation rules, with the outermost topology being strict while the others being preferred:

colocation:
  - topologyKey: rack
     mode: BestEffort
  - topologyKey: zone
     mode: Strict

if we want to support more levels, then the API should allow setting pod-affinity weights (with higher weights for the inner most ones as you pointed out offline). At this point, the colocation API drifts closer to pod-affinity, so its value becomes less obvious compared to just using pod-affinity directly :)

alculquicondor commented 1 year ago

so its value becomes less obvious compared to just using pod-affinity directly :)

Yes, but at that point it's better to enhance the pod spec to support what we need. In the meantime, we jobset can support the minimal requirement: probably just Exclusive and Strict.

qelo commented 1 year ago

it can be done by allowing users to define a list of colocation rules with the outermost topology being strict while the others being preferred:

Exactly. Btw. since BestEffort/Strict is only the choice for the outermost (and inner are always BestEffort), mode could be outside the list:

colocation:
  topologyKey: [rack, zone]
  mode: Strict

UC1: Exclusive 1:1

I'm not sure how the list should support exclusive, though. I guess exclusive makes most sense when there is one.

Btw. why would jobs want exclusive actually? Because of some noisy neighbors? I suspect what we really want is an all-or-nothing mechanism to avoid deadlocks and exclusive is just a workaround for that? Like "I would be fine to share the rack with someone if we both fit, but I want to avoid deadlocks when two of us fit partially -- so I will keep the rack exclusive as a workaround".

At this point, the colocation API drifts closer to pod-affinity, so its value becomes less obvious compared to just using pod-affinity directly

I didn't follow this. The value would be much simpler API for the user. Implementation would set pod affinities and weights, but the user would only deal with 'colocation' API not with pod affinities, right?

That said, it also makes a lot of sense if you decide to focus on something simpler in the beginning.

vsoch commented 1 year ago

Btw. why would jobs want exclusive actually? Because of some noisy neighbors?

Does this mean one pod per machine? Or exclusive use of some machine? There are many uses for our workflows for which we would absolutely not want to share a machine as it would impact performance.

danielvegamyhre commented 10 months ago

This was addressed in #309 and #342 and included in release v0.3.0 🥳

danielvegamyhre commented 10 months ago

Actually, we didn't actually change the API (it's still an annotation), but just the implementation of it, to improve scheduling throughput for large scale training. Feel free to re-open this if you want to explore this further.

ahg-g commented 10 months ago

This issue is about having a proper API, not the optimizations.

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

danielvegamyhre commented 6 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

danielvegamyhre commented 2 months ago

/remove-lifecycle rotten

kubernetes-sigs / jobset