Closed DAB0mB closed 1 year ago
Hi! First check the https://github.com/ehfd/docker-nvidia-glx-desktop#troubleshooting section. If that doesn't work for you, please post all three logs located in /tmp.
Here are the log files in case of failure, including Xorg.0.log:
entrypoint-fail-stdout.log pulseaudio-fail-stdout.log selkies-gstreamer-fail-stdout.log Xorg.0.fail.log
And here are the log files in case of success:
entrypoint-stdout.log pulseaudio-stdout.log selkies-gstreamer-stdout.log Xorg.0.log
Here's the xorg.conf:
According to Xorg.0.fail.log
:
Screen(s) found, but none have a usable configuration.
To me it seems like everything is setup correctly but there seems to be something with the nodes, I have no access to any of them whatsoever but I can submit a ticket to the cloud provider if necessary.
@DAB0mB Thanks for your cooperation. It seems to be that NVIDIA_DRIVER_CAPABILITIES
is insufficient. Please make sure that it is sent to all
, or includes utility,graphics,video,display
, by asking your provider. It won't work just by changing it inside the container.
If this works, please don't close. I have to update it in the docs.
If not possible, docker-nvidia-egl-desktop may much more likely work.
Thank you! Will update shortly once they get back to me
Any updates? @DAB0mB + I explained in https://github.com/ehfd/docker-nvidia-glx-desktop#the-container-doesnt-work what each NVIDIA_DRIVER_CAPABILITIES
entry does for the container.
Hey @ehfd, they're trying to look at it, I've been told that NVIDIA_DRIVER_CAPABILITIES
is set to all
by default and that it doesn't seem to be the issue. I will update if I have anything.
Note that the NVIDIA A100 in MIG mode does not support any GUI operations. No OpenGL and Vulkan, no X.Org.
@DAB0mB One more possible cause, is there an X server on the host? In other words, is there a GUI on the host? Or is there more than one X server trying to run for one GPU?
I also updated the container.
@DAB0mB We have identified that the issue on CoreWeave is likely due to k8s container memory requirements for the shm feature in x11. Adding something like the following should resolve the issue:
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /cache
name: cudaglx-cache-vol
- mountPath: /home/user
name: cudaglx-root-vol
imagePullSecrets:
- name: render-images
volumes:
- emptyDir:
medium: Memory
name: dshm
- name: cudaglx-cache-vol
- name: cudaglx-root-vol
https://github.com/ehfd/docker-nvidia-glx-desktop/blob/main/xgl.yml
This is in the yaml file.
Please reopen if any issues arise again.
I need to add this to the docs too.
Docs added.
Hello! Kudos for this great image. I've been trying to run a deployment on my Kubernetes cluster with nvidia-glx-desktop. Sometimes it works, but sometimes it doesn't (could be how things are cached by the cloud provider?). According to the logs, it looks like the entrypoint scripts are waiting for X11 to start and are stuck in a perpetual loop. Why do you think X11 isn't starting and how can I manually start it?