tkestack / gpu-manager

Other
789 stars 230 forks source link

empty pids goroutine 1 [running] #55

Open xuguangzhao opened 3 years ago

xuguangzhao commented 3 years ago

image

mYmNeo commented 3 years ago

What's the version of your deployed gpu-manager?We have fixed this in our latest commit 808ff8c29a361f04499ff62242cd56e4f93089f6

xuguangzhao commented 3 years ago

I use v1.0.4 . which version can I use for fixed this bug?

What's the version of your deployed gpu-manager?We have fixed this in our latest commit 808ff8c29a361f04499ff62242cd56e4f93089f6

mYmNeo commented 3 years ago

Upgrade to v1.1.2

phoenixwu0229 commented 3 years ago

Upgrade to v1.1.2

i use this version, Problem still exists

mYmNeo commented 3 years ago

Upgrade to v1.1.2

i use this version, Problem still exists

Is there any log show Read from

mqyang56 commented 3 years ago

Docker Server Version: 19.03.8

cgroupfs: /sys/fs/cgroup/memory/kubepods/burstable/pod3ac4a444-6254-4b32-bc26-bd08c9c72fbb/2b8ed585766f39bca9120b9725e7d47d607218993ab8209d7086c5064e81986d

systemd: /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod0dffdd18_155c_4f16_a5cf_3e615a07c264.slice/docker-3a9b4e354a7e35b9c7a25dcb222c19dfed5fb9e00d97c7d11bd21f9ee753f865.scope

need attempts := []string{ filepath.Join(cgroupRoot, cgroupThis, id, "tasks"), // With more recent lxc versions use, cgroup will be in lxc/ filepath.Join(cgroupRoot, cgroupThis, "lxc", id, "tasks"), // With more recent docker, cgroup will be in docker/ filepath.Join(cgroupRoot, cgroupThis, "docker", id, "tasks"), // Even more recent docker versions under systemd use docker-.scope/ filepath.Join(cgroupRoot, "system.slice", "docker-"+id+".scope", "tasks"), // Even more recent docker versions under cgroup/systemd/docker// filepath.Join(cgroupRoot, "..", "systemd", "docker", id, "tasks"), // Kubernetes with docker and CNI is even more different filepath.Join(cgroupRoot, "..", "systemd", "kubepods", "", "pod", id, "tasks"), // Another flavor of containers location in recent kubernetes 1.11+ filepath.Join(cgroupRoot, cgroupThis, "kubepods.slice", "kubepods-besteffort.slice", "", "docker-"+id+".scope", "tasks"), // When runs inside of a container with recent kubernetes 1.11+ filepath.Join(cgroupRoot, "kubepods.slice", "kubepods-besteffort.slice", "", "docker-"+id+".scope", "tasks"), }

mYmNeo commented 3 years ago

If your cgroup is systemd,you need add flag to gpu-manager

phoenixwu0229 commented 3 years ago

If your cgroup is systemd,you need add flag to gpu-manager

tks, it works

but i have another question..

in ali gpu-share solution, nvidia-smi results will show the gpu-mem requested in pod's resource request

but in gpu-manager, i see all gpu-mem in pod, it works correctly?

pod.yaml

      resources:
        limits:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"
        requests:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"

➜ gpu-manager git:(master) ✗ kubectl -n hpc-dlc exec -it container-tf-wutong6-7fd85bb484-9m8c4 bash root@host10307846:/notebooks# nvidia-smi Thu Jan 21 19:27:48 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | 0 Tesla T4 On | 00000000:18:00.0 Off | 0 | | N/A 38C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |

mYmNeo commented 3 years ago

If your cgroup is systemd,you need add flag to gpu-manager

tks, it works

but i have another question..

in ali gpu-share solution, nvidia-smi results will show the gpu-mem requested in pod's resource request

but in gpu-manager, i see all gpu-mem in pod, it works correctly?

pod.yaml

      resources:
        limits:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"
        requests:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"

➜ gpu-manager git:(master) ✗ kubectl -n hpc-dlc exec -it container-tf-wutong6-7fd85bb484-9m8c4 bash root@host10307846:/notebooks# nvidia-smi Thu Jan 21 19:27:48 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | 0 Tesla T4 On | 00000000:18:00.0 Off | 0 | | N/A 38C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |

Solution of Ali modified the kernel that means you have to use their kernel not the official

zxt620 commented 3 years ago

If your cgroup is systemd,you need add flag to gpu-manager

If your cgroup is systemd,you need add flag to gpu-manager

how to flag?

ZeoSophia commented 3 years ago

you can how to do in readme,add some parameter in gpu-manager.yaml😊

发自我的iPhone

在 2021年4月30日,上午11:24,zxt620 @.***> 写道:

 If your cgroup is systemd,you need add flag to gpu-manager

If your cgroup is systemd,you need add flag to gpu-manager

how to flag?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

yu7508 commented 2 years ago

Upgrade to v1.1.2

i use this version, Problem still exists I use the v1.1.2,and have the same problem,have you ever solved?