leptonai / gpud

Apache License 2.0
181 stars 11 forks source link

Running gpud as a daemonset/pod in an EKS cluster #103

Open chatter92 opened 1 week ago

chatter92 commented 1 week ago

Hi,

I am trying to run gpud as a privileged pod in an EKS cluster by creating a docker image out of it. Here is my docker image:

FROM nvidia/cuda:12.6.1-cudnn-devel-ubuntu24.04

#For gpu visibility
ENV NVIDIA_VISIBLE_DEVICES=all

RUN apt-get update && apt-get install -y curl

RUN mkdir /tmp/test-gpud
RUN curl -L https://github.com/leptonai/gpud/releases/download/v0.0.4/gpud_v0.0.4_linux_amd64_ubuntu24.04.tgz > /tmp/gpud_v0.0.4_linux_amd64_ubuntu24.04.tgz
RUN tar xzf /tmp/gpud_v0.0.4_linux_amd64_ubuntu24.04.tgz -C /tmp/test-gpud
RUN cp -f /tmp/test-gpud/gpud /usr/sbin

EXPOSE 15132

CMD /usr/sbin/gpud run

The pod gets created successfully and I can see gpud running in the pod when I exec into it. However, it starts erroring out after sometime and repeatedly keeps restarting because of a panic in the poller.

Here are the error logs I retrieved from a failed pod:

{"level":"info","ts":"2024-10-08T12:50:03Z","caller":"config/default.go:186","msg":"auto-detected clock events not supported -- skipping","error":null}
{"level":"info","ts":"2024-10-08T12:50:03Z","caller":"config/default.go:212","msg":"auto-detected gpm not supported -- skipping","error":null}
{"level":"info","ts":"2024-10-08T12:50:03Z","caller":"command/run.go:84","msg":"starting gpud v0.0.4"}
{"level":"info","ts":"2024-10-08T12:50:03Z","caller":"server/server.go:131","msg":"api version","version":"v1"}
{"level":"warn","ts":"2024-10-08T12:50:03Z","caller":"nvml/nvml.go:326","msg":"gpm metrics not supported"}
2024/10/08 12:50:03 Waiting for /var/log/fabricmanager.log to appear...
{"level":"info","ts":"2024-10-08T12:50:55Z","caller":"server/server.go:1008","msg":"serving 0.0.0.0:15132"}
{"level":"info","ts":"2024-10-08T12:50:55Z","caller":"command/run.go:105","msg":"successfully booted","tookSeconds":52.100623932}

✔ serving https://0.0.0.0:15132

{"level":"info","ts":"2024-10-08T12:51:03Z","caller":"server/server.go:712","msg":"components status","inflight_components":23,"evaluated_healthy_states":0,"evaluated_unhealthy_states":0,"data_collect_success":6,"data_collect_failed":0}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xb6fdea]

goroutine 141 [running]:
github.com/leptonai/gpud/pkg/systemd.(*DbusConn).IsActive(0x19c9748?, {0x19c97b8?, 0xc00052a5b0?}, {0x1764d8d?, 0xc001125a58?})
  /home/runner/work/gpud/gpud/pkg/systemd/dbus.go:34 +0x2a
github.com/leptonai/gpud/components/accelerator/nvidia/query.CheckFabricManagerActive(...)
  /home/runner/work/gpud/gpud/components/accelerator/nvidia/query/nv_fabricmanager.go:33
github.com/leptonai/gpud/components/accelerator/nvidia/query.Get({0x19c9748, 0xc000699220})
  /home/runner/work/gpud/gpud/components/accelerator/nvidia/query/query.go:96 +0x646
github.com/leptonai/gpud/components/query.pollLoops({0x19c9748, 0xc000699220}, {0x1764fbd, 0x14}, 0xc0003347e0, 0xdf8475800, 0x183d5c0)
  /home/runner/work/gpud/gpud/components/query/poller.go:124 +0x1f4
created by github.com/leptonai/gpud/components/query.startPoll in goroutine 1
  /home/runner/work/gpud/gpud/components/query/poller.go:97 +0xd7
gyuho commented 1 week ago

@chatter92 Thanks for the report!

https://github.com/leptonai/gpud/pull/104 should fix the nil pointer panic (will do the release shortly), but gpud requires systemd for it to monitor the fabric manager. And it looks like you are running inside the container without access to the host system systemd (this won't work).

chatter92 commented 1 week ago

Ok. I was able to get it running as a pod by mounting the host's /run/systemd/system and /var/run/dbus/system_bus_socket

I get the following results when I run a scan from within a container:

# gpud scan

⌛ scanning the host

⌛ scanning nvidia accelerators
{"level":"warn","ts":"2024-10-09T07:28:06Z","caller":"nvml/nvml.go:326","msg":"gpm metrics not supported"}
✔ successfully checked nvidia-smi
✔ product name: NVIDIA A100-SXM4-40GB (nvidia-smi)
✔ scanned nvidia-smi -- found no error
✔ scanned nvidia-smi -- found no hardware slowdown error
✔ successfully checked fabric manager
✘ lsmod peermem check failed with 1 error(s)
command not found: "sudo"
✔ successfully checked nvml
✔ name: NVIDIA A100-SXM4-40GB (NVML)

##################
NVML scan results for GPU-18ce9bfc-5ee5-b777-6d4c-5c999445b9ec

✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process

##################
NVML scan results for GPU-48ca08e2-a532-3abf-f123-441600bcb0da

✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process

##################
NVML scan results for GPU-9bc33c1d-9c69-60e3-2f3c-794df2341623

✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process

##################
NVML scan results for GPU-b0410e8c-5fbb-01e8-6924-e89e1b60ad99

✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process

##################
NVML scan results for GPU-b0841e0d-210d-fb8e-c8bc-8631631ebd06

✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process

##################
NVML scan results for GPU-b85f0b9a-9408-4142-20f3-f94cef29e8e1

✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process

##################
NVML scan results for GPU-c39c3031-cf63-68bf-e421-67e8063e3e59

✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process

##################
NVML scan results for GPU-e3ba36c4-bd30-d55d-eb10-38462e1a7434

✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process

⌛ scanning dmesg for 5000 lines
✔ scanned dmesg file -- found no issue

✔ scan complete
chatter92 commented 1 week ago

@gyuho and team, do you know if I may run into any other issues if I try to run it as a daemonset on all my gpu instances? Are there any other host paths/processes gpud needs access to?

gyuho commented 1 week ago

I get the following results when I run a scan from within a container:

Looks good!

run it as a daemonset on all my gpu instances

So far, the systemd is the only hard dependency (required to check the fabric manager, and others).

lsmod peermem check failed with 1 error(s)

This is only required when you enabled the peermem module. If you don't use infiniband in your infra, you can ignore it for now.

xiang90 commented 2 days ago

@chatter92 Let us know how it goes! Running on k8s as a daemonset is an important use case we want to support.