pokerfaceSad / GPUMounter

A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod
Apache License 2.0
118 stars 26 forks source link

Insufficient GPU on Node: xxx #23

Open lyon-v opened 10 months ago

lyon-v commented 10 months ago

第一次挂载成功了,后面卸载再次deploy 显示这个 Insufficient GPU on Node: yigou-dev-102-46,gpu 实际空闲

pokerfaceSad commented 10 months ago

describe node看下GPU资源是否空闲

lyon-v commented 10 months ago

1.master日志: [root@yigou-dev-102-45 examples]# kubectl logs gpu-mounter-master-bc547448d-t5nkl -n kube-system 2023-11-07T14:32:53.960Z INFO GPUMounter-master/main.go:239 Start gpu mounter master on :8080 2023-11-08T01:04:59.791Z INFO GPUMounter-master/main.go:25 access add gpu service 2023-11-08T01:04:59.791Z INFO GPUMounter-master/main.go:30 Pod: gpu-pod-1 Namespace: default GPU Num: 1 Is entire mount: false 2023-11-08T01:04:59.812Z INFO GPUMounter-master/main.go:66 Found Pod: gpu-pod-1 in Namespace: default on Node: yigou-dev-102-46 2023-11-08T01:04:59.822Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-s8j2f Node: yigou-dev-102-46 2023-11-08T01:04:59.925Z ERROR GPUMounter-master/main.go:109 Insufficient GPU on Node: yigou-dev-102-46

############################################################################################ 2.worker 日志: [root@yigou-dev-102-45 elastic-jupyter]# kubectl logs gpu-mounter-workers-s8j2f -n kube-system 2023-11-07T14:30:26.648Z INFO GPUMounter-worker/main.go:15 Service Starting... 2023-11-07T14:30:26.648Z INFO gpu-mount/server.go:22 Creating gpu mounter 2023-11-07T14:30:26.648Z INFO allocator/allocator.go:28 Creating gpu allocator 2023-11-07T14:30:26.648Z INFO collector/collector.go:24 Creating gpu collector 2023-11-07T14:30:26.648Z INFO collector/collector.go:42 Start get gpu info 2023-11-07T14:30:26.652Z INFO collector/collector.go:53 GPU Num: 2 2023-11-07T14:30:26.664Z INFO collector/collector.go:91 Updating GPU status 2023-11-07T14:30:26.667Z INFO collector/collector.go:136 GPU status update successfully 2023-11-07T14:30:26.667Z INFO collector/collector.go:36 Successfully update gpu status 2023-11-07T14:30:26.667Z INFO allocator/allocator.go:35 Successfully created gpu collector 2023-11-07T14:30:26.667Z INFO gpu-mount/server.go:29 Successfully created gpu allocator 2023-11-07T14:30:26.667Z INFO GPUMounter-worker/main.go:22 Successfully created gpu mounter 2023-11-08T01:04:59.825Z INFO gpu-mount/server.go:35 AddGPU Service Called 2023-11-08T01:04:59.825Z INFO gpu-mount/server.go:36 request: pod_name:"gpu-pod-1" namespace:"default" gpu_num:1 2023-11-08T01:04:59.848Z INFO gpu-mount/server.go:55 Successfully get Pod: default in cluster 2023-11-08T01:04:59.848Z INFO allocator/allocator.go:159 Get pod default/gpu-pod-1 mount type 2023-11-08T01:04:59.848Z INFO collector/collector.go:91 Updating GPU status 2023-11-08T01:04:59.851Z INFO collector/collector.go:136 GPU status update successfully 2023-11-08T01:04:59.880Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: gpu-pod-1-slave-pod-1d3148 for Owner Pod: gpu-pod-1 2023-11-08T01:04:59.880Z INFO allocator/allocator.go:238 Checking Pods: gpu-pod-1-slave-pod-1d3148 state 2023-11-08T01:04:59.882Z INFO allocator/allocator.go:264 Pod: gpu-pod-1-slave-pod-1d3148 creating 2023-11-08T01:04:59.886Z INFO allocator/allocator.go:264 Pod: gpu-pod-1-slave-pod-1d3148 creating 2023-11-08T01:04:59.917Z INFO allocator/allocator.go:268 No enough gpu for Pod: gpu-pod-1-slave-pod-1d3148 2023-11-08T01:04:59.925Z ERROR gpu-mount/server.go:70 Insufficient gpu for Pod: gpu-pod-1 Namespace: default

####################################################################################### 3.node 资源: [root@yigou-dev-102-45 yamls]# kubectl describe node yigou-dev-102-46 Name: yigou-dev-102-46 Roles: cpu,training Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux fluid.io/dataset-num=1 fluid.io/f-fluid-fluiddataset=true fluid.io/s-alluxio-fluid-fluiddataset=true fluid.io/s-fluid-fluiddataset=true fluid.io/s-h-alluxio-d-fluid-fluiddataset=5GiB fluid.io/s-h-alluxio-t-fluid-fluiddataset=5GiB gpu-mounter-enable=enable kubernetes.io/arch=amd64 kubernetes.io/hostname=yigou-dev-102-46 kubernetes.io/os=linux node-role.kubernetes.io/cpu=true node-role.kubernetes.io/training=true Annotations: csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"yigou-dev-102-46","fuse.csi.fluid.io":"yigou-dev-102-46"} kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/cri-dockerd.sock node.alpha.kubernetes.io/ttl: 0 projectcalico.org/IPv4Address: 10.0.102.46/24 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 28 Aug 2023 17:11:40 +0800 Taints: Unschedulable: false Lease: HolderIdentity: yigou-dev-102-46 AcquireTime: RenewTime: Wed, 08 Nov 2023 10:58:29 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Tue, 07 Nov 2023 18:13:30 +0800 Tue, 07 Nov 2023 18:13:30 +0800 CalicoIsUp Calico is running on this node MemoryPressure False Wed, 08 Nov 2023 10:58:23 +0800 Fri, 03 Nov 2023 10:39:46 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 08 Nov 2023 10:58:23 +0800 Fri, 03 Nov 2023 10:39:46 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 08 Nov 2023 10:58:23 +0800 Fri, 03 Nov 2023 10:39:46 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 08 Nov 2023 10:58:23 +0800 Tue, 07 Nov 2023 18:13:01 +0800 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.102.46 Hostname: yigou-dev-102-46 Capacity: cpu: 32 ephemeral-storage: 1018975Mi hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65801852Ki nvidia.com/gpu: 0 nvidia.com/nvidia-rtx-3090: 2 pods: 110 Allocatable: cpu: 32 ephemeral-storage: 961625455048 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65699452Ki nvidia.com/gpu: 0 nvidia.com/nvidia-rtx-3090: 2 pods: 110 System Info: Machine ID: 2a5da2a3fe9d480f97ca66b4d8f4287b System UUID: 32B53042-47EB-9349-E452-0B470FA25211 Boot ID: 87d25649-faa9-45da-a0ce-5b4b102fd56e Kernel Version: 3.10.0-1160.99.1.el7.x86_64 OS Image: CentOS Linux 7 (Core) Operating System: linux Architecture: amd64 Container Runtime Version: docker://24.0.6 Kubelet Version: v0.0.0-master+5244794d27b4cc68290bc496b00e248857ac8b47 Kube-Proxy Version: v0.0.0-master+5244794d27b4cc68290bc496b00e248857ac8b47 PodCIDR: 100.64.1.0/24 PodCIDRs: 100.64.1.0/24 Non-terminated Pods: (30 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age


calico-system calico-node-gfzjw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d calico-system calico-typha-856c6c9c4c-bzzbn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d calico-system csi-node-driver-vlxt9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d default gpu-pod-1 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42s elastic-jupyter-operator-system elastic-jupyter-operator-controller-manager-5d559bbbb8-77frj 100m (0%) 100m (0%) 20Mi (0%) 30Mi (0%) 23h fluid-system alluxioruntime-controller-5b4fd8d788-56c4k 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d fluid-system csi-nodeplugin-fluid-4dw44 0 (0%) 0 (0%) 0 (0%) 0 (0%) 15d fluid-system dataset-controller-665ff849b7-cnrh9 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d fluid-system dataset-controller-665ff849b7-xrvm7 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 29d fluid-system fluid-webhook-8689694b95-jsn49 0 (0%) 0 (0%) 0 (0%) 0 (0%) 49d fluid-system fluidapp-controller-698b685d4f-z84r6 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d fluid-system thinruntime-controller-674bb4784b-qvcks 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d fluid fluiddataset-fuse-264x4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16d fluid fluiddataset-master-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d fluid fluiddataset-worker-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d heros-controllers-system hero-controllers-controller-manager-cbdb77cf6-l7bjl 15m (0%) 1 (3%) 128Mi (0%) 256Mi (0%) 38h heros-system file-proxy-5b5f76cf8d-xnj72 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h kube-system gpu-mounter-master-bc547448d-x77th 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42m kube-system gpu-mounter-workers-7lf7k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42m kube-system kube-proxy-rh2vt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d kube-system kube-sealos-lvscare-yigou-dev-102-46 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d kube-system tigera-operator-66fd59dc66-tn24j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d monitoring alertmanager-main-0 104m (0%) 200m (0%) 150Mi (0%) 150Mi (0%) 5d monitoring cadvisor-8cwmf 400m (1%) 800m (2%) 400Mi (0%) 2000Mi (3%) 12d monitoring dcgm-exporter-7rxkt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h monitoring loki-stack-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d monitoring loki-stack-promtail-x9zwg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16d monitoring node-exporter-ppdvb 112m (0%) 270m (0%) 200Mi (0%) 220Mi (0%) 63d monitoring prometheus-k8s-0 100m (0%) 100m (0%) 450Mi (0%) 50Mi (0%) 5d nvidia-device-plugin nvdp-nvidia-device-plugin-r4zjr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits


cpu 1331m (4%) 2970m (9%) memory 2348Mi (3%) 10386Mi (16%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0 nvidia.com/nvidia-rtx-3090 0 0 Events: Type Reason Age From Message


Normal RegisteredNode 32m node-controller Node yigou-dev-102-46 event: Registered Node yigou-dev-102-46 in Controller

lyon-v commented 10 months ago

我可能说错了,第一次也没挂上gpu, 但是gpu是空闲的。上面的信息是master和worker的日志,和46节点的信息,gpu-pool 下面没有slave-pod. [root@yigou-dev-102-45 ~]# kubelet --version Kubernetes v1.25.13 这是k8s版本信息

pokerfaceSad commented 10 months ago

在k8s 1.20+有一个已知问题,ownerReference不允许跨namespaces,因此slavePod会创建失败

参考下https://github.com/pokerfaceSad/GPUMounter/issues/19#issuecomment-1034134013

lyon-v commented 10 months ago

[root@yigou-dev-102-45 ~]# cat /etc/docker/daemon.json { "data-root": "/var/lib/docker", "exec-opts": [ "native.cgroupdriver=systemd" ], "insecure-registries": [ "registry.bitahub.com:5000", "registry.hub.com:5000", "docker-user.cambricon.com:30080", "10.10.8.100:5000", "10.11.3.8:5000", "112.31.12.176:5000", "10.12.4.35:5000", "10.0.0.12:5000" ], "log-driver": "json-file", "log-level": "warn", "log-opts": { "max-file": "3", "max-size": "10m" }, "max-concurrent-downloads": 20, "registry-mirrors": [ "https://reg-mirror.qiniu.com/", "https://pqs5j944.mirror.aliyuncs.com", "https://7bezldxe.mirror.aliyuncs.com/", "https://registry.docker-cn.com", "http://hub-mirror.c.163.com", "https://docker.mirrors.ustc.edu.cn/" ], "default-runtime": "nvidia", "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } } } 上面是我的/etc/docker/daemon.json,我试试将改个环境变量