Only pod infos and minMember is set in podgroup, which resulting to function missing, as well as unpredicatable bugs during allocation.
For example, since no minResources field is filled in podgroup, gang scheduler volcano cannot diff tfjobs from bestEffort jobs as both of the two jobs owns nil minResources, causing all tfjobs can be inqueue and action enqueue , reserve lose effort.
So in my opinion, we need to supplement more infos about tfjob into podgroup, such as minMember, queue as well as other fields, so as to make sure gang scheduler workers correctly.
When
enable-gang-scheduler=true
, tf-operator will create CRDpodgroup
to permit gang scheduler volcano to allocate the pods. but when createing pod in funcSyncPodGroup
: https://github.com/kubeflow/common/blob/3fbe0ce982691279357e33e573a93d8ce4254584/pkg/controller.v1/common/job_controller.go#L211Only pod infos and
minMember
is set in podgroup, which resulting to function missing, as well as unpredicatable bugs during allocation.For example, since no
minResources
field is filled in podgroup, gang scheduler volcano cannot diff tfjobs from bestEffort jobs as both of the two jobs ownsnil minResources
, causing all tfjobs can beinqueue
and actionenqueue
,reserve
lose effort.So in my opinion, we need to supplement more infos about tfjob into podgroup, such as
minMember
,queue
as well as other fields, so as to make sure gang scheduler workers correctly.