kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

Fix a bug that the PodGroupCtrl can not list priorityclass #561

Closed tenzen-y closed 1 year ago

tenzen-y commented 1 year ago

Currently, we don't pass a lister for priorityClasses to podGroup for the scheduler-plugins. So the nil pointer error happens when the podGroupCtrl for the scheduler-plugins list priorityClasses in the following:

E0607 07:29:50.611829       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 437 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x19cf5a0?, 0x2d22520})
    /go/pkg/mod/k8s.io/apimachinery@v0.25.7/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00081c620?})
    /go/pkg/mod/k8s.io/apimachinery@v0.25.7/pkg/util/runtime/runtime.go:49 +0x75
panic({0x19cf5a0, 0x2d22520})
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/kubeflow/mpi-operator/pkg/controller.(*SchedulerPluginsCtrl).calculatePGMinResources(0xc0008051d0, 0xc0014d971c, 0x1c7368d?)
    /go/src/github.com/kubeflow/mpi-operator/pkg/controller/podgroup.go:314 +0x1cc
github.com/kubeflow/mpi-operator/pkg/controller.(*SchedulerPluginsCtrl).newPodGroup(0xc001d86750?, 0xc000a19d40)
    /go/src/github.com/kubeflow/mpi-operator/pkg/controller/podgroup.go:249 +0x18a
github.com/kubeflow/mpi-operator/pkg/controller.(*MPIJobController).getOrCreatePodGroups(0xc00074cc60, 0xc000a19d40?)

So, I passed a lister for the priorityClass to podGroupCtrl for the scheduler-plugins.

tenzen-y commented 1 year ago

@alculquicondor I addressed your suggestions. PTAL.

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)~~ [alculquicondor] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment