Open yalbaba opened 1 week ago
我认为这里的判断超过queue的capability时比较条件不完整,deserved > capability应该是所有维度的资源都超过才算
have you tried v1.9.0?
have you tried v1.9.0?
not yet ,the volcano upgrade may have an impact on our project
have you tried v1.9.0? I wonder why this code was designed this way
能给予解答吗?
原始的这块代码是drf的实现逻辑,可用参考论文Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
计算方法类似这种https://koordinator.sh/zh-Hans/docs/designs/multi-hierarchy-elastic-quota-management
.
目前问题:pod向queue提交资源申请会出现queue资源足够但是pod无法调度 case: 集群资源:cpu:2 内存:2 ScalarResources:"nvidia/gpu":8 q1 capability:cpu:0 内存:0 ScalarResources:"nvidia/gpu":0 weight:1 q2 capability:cpu:1 内存:1 ScalarResources:"nvidia/gpu":1 weight:1 q3 capability:cpu:2 内存:2 ScalarResources:"nvidia/gpu":7 weight:1
假设三个pod分别都向三个q申请资源 pod1 =》 q1 申请:cpu:0 内存:0 ScalarResources:"nvidia/gpu":0 pod2 =》q2 申请:cpu:1 内存:1 ScalarResources:"nvidia/gpu":1 pod3 =》q3 申请:cpu:2 内存:2 ScalarResources:"nvidia/gpu":7
在第一轮分配过程中 q1 deserved:cpu:0.66 内存:0.66 ScalarResources:"nvidia/gpu":2.6 此时发生问题的函数片段是:
// 有问题的代码 if attr.capability != nil && !attr.deserved.LessEqualStrict(attr.capability) { attr.deserved = helpers.Min(attr.deserved, attr.capability) attr.deserved = helpers.Min(attr.deserved, attr.request) meet[attr.queueID] = struct{}{} klog.V(4).Infof("queue <%s> is meet cause of the capability", attr.name) }
其中LessEqualStrict函数里: ` // 有问题的代码 func (r Resource) LessEqualStrict(rr Resource) bool { lessFunc := func(l, r float64) bool { return l <= r }
} ` 由于只有ScalarResources超过了capability,所以q1的分配就退出了,也就不参与下轮分配了,所以q1实得资源为:cpu:0.66 内存:0.66 gpu:2.66 ,因此接下里接入allocate环节,pod2无法启动,因为资源不足