Closed talcoh2x closed 1 year ago
Thanks for your reporting, It seems the same issue with #2034, we are dealing with it.
@talcoh2x Would you like to provide the job yaml?
# High priority job
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
creationTimestamp: "2022-07-07T10:33:13Z"
generation: 1
name: high-priority-mpijob
namespace: app
resourceVersion: "5940716"
uid: 67724b4f-84d0-473d-bc62-317bf686fa90
spec:
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
creationTimestamp: null
name: high-priority-mpijob
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: is-vmi
operator: In
values:
- "true"
topologyKey: kubernetes.io/hostname
containers:
- args:
- sleep 1d
# command shortened
command:
- mpirun
- ...
image: goodimage
name: high-priority-mpijob
resources:
limits:
cpu: "2"
memory: 2Gi
requests:
cpu: "2"
memory: 2Gi
volumeMounts:
- mountPath: /software
name: software
initContainers:
- args:
- mkdir -p /root/logs/launcher && ./dnswaiter
command:
- /bin/bash
- -c
image: goodimage
imagePullPolicy: Always
name: wait-dns
resources:
limits:
cpu: "5"
memory: 5Gi
requests:
cpu: 100m
memory: 500Mi
volumeMounts:
- mountPath: /etc/mpi
name: mpi-job-config
- mountPath: /root/logs
name: logs
workingDir: /root
restartPolicy: Never
Worker:
replicas: 1
template:
metadata:
creationTimestamp: null
name: high-priority-mpijob
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: is-vmi
operator: In
values:
- "true"
topologyKey: kubernetes.io/hostname
containers:
image: goodimage
name: high-priority-mpijob
resources:
limits:
cpu: "50"
habana.ai/gaudi: "4" # gpu
hugepages-2Mi: 100000Mi
securityContext:
privileged: true
volumeMounts:
# some mounts
workingDir: /root
initContainers:
- args:
- mkdir -p /root/logs/$HOSTNAME
command:
- /bin/bash
- -c
env:
- name: DRIVER_WITH_NETWORK
value: "false"
image: goodimage
imagePullPolicy: IfNotPresent
name: prepare-node
resources:
limits:
cpu: "5"
memory: 5Gi
requests:
cpu: "5"
memory: 5Gi
securityContext:
privileged: false
volumeMounts:
# mounts
workingDir: /root
priorityClassName: high
schedulerName: volcano
volumes:
# some volumes
runPolicy:
backoffLimit: 0
cleanPodPolicy: All
ttlSecondsAfterFinished: 300
slotsPerWorker: 8
status:
conditions:
- lastTransitionTime: "2022-07-07T10:33:13Z"
lastUpdateTime: "2022-07-07T10:33:13Z"
message: MPIJob xxxx is created.
reason: MPIJobCreated
status: "True"
type: Created
replicaStatuses:
Launcher: {}
Worker: {}
startTime: "2022-07-07T10:33:13Z"
# No priority jobs
- apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
creationTimestamp: "2022-07-07T10:32:17Z"
generation: 1
name: no-priority-mpijob
namespace: app
resourceVersion: "5940444"
uid: 9359f6ef-1bde-427d-b4bf-74a86fe3467a
spec:
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
creationTimestamp: null
name: no-priority-mpijob
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: is-vmi
operator: In
values:
- "true"
topologyKey: kubernetes.io/hostname
containers:
- args:
- sleep 1d
command:
- mpirun
- --allow-run-as-root
# ....
image: goodimage
name: no-priority-mpijob
resources:
limits:
cpu: "2"
memory: 2Gi
requests:
cpu: "2"
memory: 2Gi
volumeMounts:
- mountPath: /software
name: software
initContainers:
- args:
- mkdir -p /root/logs/launcher && ./dnswaiter
command:
- /bin/bash
- -c
image: goodimage
imagePullPolicy: Always
name: wait-dns
resources:
limits:
cpu: "5"
memory: 5Gi
requests:
cpu: 100m
memory: 500Mi
volumeMounts:
#mounts
workingDir: /root
restartPolicy: Never
volumes:
# some volumes
Worker:
replicas: 1
template:
metadata:
creationTimestamp: null
name: no-priority-mpijob
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: is-vmi
operator: In
values:
- "true"
topologyKey: kubernetes.io/hostname
containers:
image: goodimage
name: no-priority-mpijob
resources:
limits:
cpu: "50"
habana.ai/gaudi: "4" # gpu
hugepages-2Mi: 100000Mi
securityContext:
privileged: true
volumeMounts:
#mounts
workingDir: /root
initContainers:
- args:
- mkdir -p /root/logs/$HOSTNAME
command:
- /bin/bash
- -c
env:
- name: DRIVER_WITH_NETWORK
value: "false"
image: goodimage
imagePullPolicy: IfNotPresent
name: prepare-node
resources:
limits:
cpu: "5"
memory: 5Gi
requests:
cpu: "5"
memory: 5Gi
securityContext:
privileged: false
volumeMounts:
# mounts
workingDir: /root
schedulerName: volcano
volumes:
# volumes
runPolicy:
backoffLimit: 0
cleanPodPolicy: All
ttlSecondsAfterFinished: 300
slotsPerWorker: 8
status:
conditions:
- lastTransitionTime: "2022-07-07T10:32:17Z"
lastUpdateTime: "2022-07-07T10:32:17Z"
message: MPIJob xxx is created.
reason: MPIJobCreated
status: "True"
type: Created
replicaStatuses:
Launcher: {}
Worker:
active: 1
startTime: "2022-07-07T10:32:17Z"
kind: List
metadata:
resourceVersion: ""
selfLink: ""
/assign @waiterQ
@william-wang: GitHub didn't allow me to assign the following users: waiterQ.
Note that only volcano-sh members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
@talcoh2x @snirkop89 I'm also taking a test about this bug. Can you provide your scheduler configuration?
sure: I tried a few variations (found them in the issue here), both yielded the same result - which preemption doesn't occur:
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill, preempt"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: volcano
meta.helm.sh/release-namespace: volcano-system
creationTimestamp: "2022-06-29T11:24:44Z"
labels:
app.kubernetes.io/managed-by: Helm
name: volcano-scheduler-configmap
namespace: volcano-system
resourceVersion: "5116689"
uid: a60fff7a-6da2-4f7b-922c-da3447fae82f
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill, preempt"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: overcommit
- name: drf
enablePreemptable: false
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: volcano
meta.helm.sh/release-namespace: volcano-system
creationTimestamp: "2022-06-29T11:24:44Z"
labels:
app.kubernetes.io/managed-by: Helm
name: volcano-scheduler-configmap
namespace: volcano-system
resourceVersion: "5116689"
uid: a60fff7a-6da2-4f7b-922c-da3447fae82f
@snirkop89 Hi, Snir. I've taken a look at the bug and preemption was broken indeed. There are several reasons about that. Firstly, the podgroup for job with high priority cannot convert from pending
to inqueue
. So the job has no chance to get resources. You can configure the scheduler as follows to disable jobEnqueued
functions.
actions: "enqueue, allocate, backfill, preempt"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: overcommit
enableJobEnqueued: false ## disable jobEnqueued function for overcommit plugin
- name: drf
- name: predicates
- name: proportion
enableJobEnqueued: false ## disable jobEnqueued function for proportion plugin
- name: nodeorder
- name: binpack
As what I tested locally, it can make podgroup with high priority enter inqueue
status. But preeption was still not working. I'll give more feedback as soon as the root reason is found.
@Thor-wl Hi, I'm also studying the preemption behavior of Volcano, and found the same problem. It seems that the JobStarvingFn of the gang plugin forbids preemption from a job where ji.CheckTaskMinAvailablePipelined()
is false.
I did find the log from the scheduler's log (in my test , the job default/priority-job has a higher priority but is waiting).
I0721 03:40:52.353591 1 job_info.go:773] Job default/priority-job Task default-nginx occupied 0 less than task min avaliable
Then I disabled the JobStarvingFn from the gang plugin by setting enableJobStarving: false
. Then the preemption worked. So is this a by-design feature or a bug? Why a false return value of CheckTaskMinAvailablePipelined
prohibits preemption?
@Thor-wl Hi, I'm also studying the preemption behavior of Volcano, and found the same problem. It seems that the JobStarvingFn of the gang plugin forbids preemption from a job where
ji.CheckTaskMinAvailablePipelined()
is false. I did find the log from the scheduler's log (in my test , the job default/priority-job has a higher priority but is waiting).I0721 03:40:52.353591 1 job_info.go:773] Job default/priority-job Task default-nginx occupied 0 less than task min avaliable
Then I disabled the JobStarvingFn from the gang plugin by settingenableJobStarving: false
. Then the preemption worked. So is this a by-design feature or a bug? Why a false return value ofCheckTaskMinAvailablePipelined
prohibits preemption?
Thanks for the feedback. That's what I also found yesterday. IMO, it's not something as expected. I'm tracking which commit and when this behavior is introduced.
https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/actions/preempt/preempt.go#L124-L126
Pods with Preemptable
= false will not be preempted, but it seems that task.Preemptable
is false by default if we don't set annotation or label.
https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/api/pod_info.go#L101
@Thor-wl I don't know if this could be the problem.
Similar reclaim action may have problem as well, #2340
https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/actions/reclaim/reclaim.go#L135-L137
Pods with
Preemptable
= false will not be preempted, but it seems thattask.Preemptable
is false by default if we don't set annotation or label. https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/api/pod_info.go#L101@Thor-wl I don't know if this could be the problem. Similar reclaim action may have problem as well, #2340 https://github.com/volcano-sh/volcano/blob/1b2630605b3f6669b5db9a77abd12d08f923e6d3/pkg/scheduler/actions/reclaim/reclaim.go#L135-L137
In order to keep compatible with the former versions, task.Preemptable
should be true
by default. I've tracked the commits and the default value false
was introduced here:
https://github.com/volcano-sh/volcano/blob/2bb5ac7a7c593da6475e51118cf7a69e117ceafa/pkg/scheduler/api/pod_info.go#L76
@wpeng102 It seems that TDM plugin introduced the behavior. Let's take a review. Thanks!
That's great to hear. Thank you for the fast response and feedback.
That's great to hear. Thank you for the fast response and feedback.
No worries. The fix is under discussion.
Hi, is there an update about this?
@william-wang @Thor-wl
Hi Guys do you have something new to update we are really stuck and need help
Hi, there is something new to update ?
@zhypku Hi, can you share with us the Volcano configuration you have and worked for you ? I mean the preemption flow
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗
we run gang scheduling job with high-priority but we dont see that the default priory jobs releasing once we don't enough resources.
expected: we expect that in such cases lower priority jobs are getting deleted.
Volcano version 1.6.0 K8s version 1.22/1.21