volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.24k stars 971 forks source link

[Proposal] Need to refactor the reclaim action #3738

Open JesseStutler opened 2 months ago

JesseStutler commented 2 months ago

Please describe your problem in detail

I'm ready to add preemptionPolicy related logic in reclaim action, but when I'm confused that when the task's preemptionPolicy is Never, do I need to push back the job and queue, continue allowing other tasks or jobs to reclaim resources. You can see that in the reclaim action, https://github.com/volcano-sh/volcano/blob/0843c0d33fccdb85439dc7086dd7cea061070901/pkg/scheduler/actions/reclaim/reclaim.go#L86-L220, reclaim firstly pop a queue and a job, but at line 110-112, 116-118, 121-124,126-129,https://github.com/volcano-sh/volcano/blob/0843c0d33fccdb85439dc7086dd7cea061070901/pkg/scheduler/actions/reclaim/reclaim.go#L110-L112 https://github.com/volcano-sh/volcano/blob/0843c0d33fccdb85439dc7086dd7cea061070901/pkg/scheduler/actions/reclaim/reclaim.go#L116-L119 https://github.com/volcano-sh/volcano/blob/0843c0d33fccdb85439dc7086dd7cea061070901/pkg/scheduler/actions/reclaim/reclaim.go#L121-L124 https://github.com/volcano-sh/volcano/blob/0843c0d33fccdb85439dc7086dd7cea061070901/pkg/scheduler/actions/reclaim/reclaim.go#L126-L129 If the task fails to filter in allocatable, Preemptive, PrePredicateFn, the queue will never be pushed back, but whether if other tasks in same jobs or other jobs in same queue can reclaim resources, I'm little bit confused about the logic here, I think when I need to implement the preemptionPolicy, there is need to push the job and queue back to allow others to continue reclaiming.

You can also see the logic in allocate: https://github.com/volcano-sh/volcano/blob/0843c0d33fccdb85439dc7086dd7cea061070901/pkg/scheduler/actions/allocate/allocate.go#L192-L199, at line 192, it wraps with !tasks.Empty() loop, so if the task fails to filter in allocatable, it's reasonable to continue here, allow other task to continue allocate.


9.20 updated: After discussing with @Monokaix @hwdef @lowang-bh , we think there are some problems in reclaim action, need to refactor the reclaim action

Any other relevant information

No response

googs1025 commented 2 months ago

/cc

Monokaix commented 2 months ago

It's a good catch!