Open chatter92 opened 1 week ago
@chatter92 Thanks for the report!
https://github.com/leptonai/gpud/pull/104 should fix the nil pointer panic (will do the release shortly), but gpud requires systemd for it to monitor the fabric manager. And it looks like you are running inside the container without access to the host system systemd (this won't work).
Ok. I was able to get it running as a pod by mounting the host's /run/systemd/system
and /var/run/dbus/system_bus_socket
I get the following results when I run a scan from within a container:
# gpud scan
⌛ scanning the host
⌛ scanning nvidia accelerators
{"level":"warn","ts":"2024-10-09T07:28:06Z","caller":"nvml/nvml.go:326","msg":"gpm metrics not supported"}
✔ successfully checked nvidia-smi
✔ product name: NVIDIA A100-SXM4-40GB (nvidia-smi)
✔ scanned nvidia-smi -- found no error
✔ scanned nvidia-smi -- found no hardware slowdown error
✔ successfully checked fabric manager
✘ lsmod peermem check failed with 1 error(s)
command not found: "sudo"
✔ successfully checked nvml
✔ name: NVIDIA A100-SXM4-40GB (NVML)
##################
NVML scan results for GPU-18ce9bfc-5ee5-b777-6d4c-5c999445b9ec
✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process
##################
NVML scan results for GPU-48ca08e2-a532-3abf-f123-441600bcb0da
✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process
##################
NVML scan results for GPU-9bc33c1d-9c69-60e3-2f3c-794df2341623
✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process
##################
NVML scan results for GPU-b0410e8c-5fbb-01e8-6924-e89e1b60ad99
✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process
##################
NVML scan results for GPU-b0841e0d-210d-fb8e-c8bc-8631631ebd06
✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process
##################
NVML scan results for GPU-b85f0b9a-9408-4142-20f3-f94cef29e8e1
✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process
##################
NVML scan results for GPU-c39c3031-cf63-68bf-e421-67e8063e3e59
✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process
##################
NVML scan results for GPU-e3ba36c4-bd30-d55d-eb10-38462e1a7434
✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process
⌛ scanning dmesg for 5000 lines
✔ scanned dmesg file -- found no issue
✔ scan complete
@gyuho and team, do you know if I may run into any other issues if I try to run it as a daemonset on all my gpu instances? Are there any other host paths/processes gpud needs access to?
I get the following results when I run a scan from within a container:
Looks good!
run it as a daemonset on all my gpu instances
So far, the systemd is the only hard dependency (required to check the fabric manager, and others).
lsmod peermem check failed with 1 error(s)
This is only required when you enabled the peermem module. If you don't use infiniband in your infra, you can ignore it for now.
@chatter92 Let us know how it goes! Running on k8s as a daemonset is an important use case we want to support.
Hi,
I am trying to run gpud as a privileged pod in an EKS cluster by creating a docker image out of it. Here is my docker image:
The pod gets created successfully and I can see gpud running in the pod when I exec into it. However, it starts erroring out after sometime and repeatedly keeps restarting because of a panic in the poller.
Here are the error logs I retrieved from a failed pod: