kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

Respect SchedulingPolicy #518

Closed tenzen-y closed 1 year ago

tenzen-y commented 1 year ago

/kind feature

Currently, the mpi-operator does not support the Runpolicy.SchedulingPolicy. For example, we don't respect the SchedulingPolicy.MinAvailable when we create the PodGroup in the following:

https://github.com/kubeflow/mpi-operator/blob/5f1914bfb29b9b94db209e620a3b2bf7e5ca7d9f/pkg/controller/mpi_job_controller.go#L1291-L1316

Also, supporting the SchedulingPolicy, we can set the various PodGroup parameters for the coscheduling-plugin.

// PodGroupSpec represents the template of a pod group.
type PodGroupSpec struct {
    // MinMember defines the minimal number of members/tasks to run the pod group;
    // if there's not enough resources to start all tasks, the scheduler
    // will not start anyone.
    MinMember int32 `json:"minMember,omitempty"`

    // MinResources defines the minimal resource of members/tasks to run the pod group;
    // if there's not enough resources to start all tasks, the scheduler
    // will not start anyone.
    MinResources v1.ResourceList `json:"minResources,omitempty"`

    // ScheduleTimeoutSeconds defines the maximal time of members/tasks to wait before run the pod group;
    ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"`
}

https://github.com/kubernetes-sigs/scheduler-plugins/blob/f996e5caf6c77d521d574186dca793e351c45413/apis/scheduling/v1alpha1/types.go#L139-L153

tenzen-y commented 1 year ago

/assign Blocking #500.