Closed oldthreefeng closed 10 months ago
Would it be convenient for you to provide the following information so that we can locate the problem, we will be very grateful.
$ k get ds volcano-device-plugin -o yaml | neat
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "6"
name: volcano-device-plugin
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
name: volcano-device-plugin
template:
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
name: volcano-device-plugin
spec:
containers:
- args:
- --gpu-strategy=number
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: volcanosh/volcano-device-plugin:latest
imagePullPolicy: IfNotPresent
name: volcano-device-plugin
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- SYS_ADMIN
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/device-plugins
name: device-plugin
- mountPath: /usr/local/vgpu
name: lib
- mountPath: /tmp
name: hosttmp
dnsPolicy: ClusterFirst
nodeSelector:
nvidia-device-enable: enable
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: volcano-device-plugin
serviceAccountName: volcano-device-plugin
terminationGracePeriodSeconds: 30
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: volcano.sh/gpu-memory
operator: Exists
- key: volcano.sh/gpu
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/device-plugins
type: ""
name: device-plugin
- hostPath:
path: /usr/local/vgpu
type: ""
name: lib
- hostPath:
path: /tmp
type: ""
name: hosttmp
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
@wangyang0616
yaml of workload is vcjob. as edit.
when change #volcano.sh/gpu-number: 1 to nvidia.com/gpu: 1 , error go away, but pod did not schduler. is pending. but resource is enough.
This is because volcano-device-plugin is a daemon service, and some nodes may cause pod exceptions due to a lack of GPU resources. The temporary solution is to use taints or affinity to bypass these abnormal nodes.
log
3个节点, 10.122.2.26 / 10.122.2.37 是 gpu 机器, ;10.122.2.14是 cpu 机器。 切换成 nvidia.com/gpu 这个资源,直接调度失败。 目前原因未知
是7月份部署的。 latest 镜像