selkies-project / docker-nvidia-glx-desktop

KDE Plasma Desktop container designed for Kubernetes, supporting OpenGL EGL and GLX, Vulkan, and Wine/Proton for NVIDIA GPUs through WebRTC and HTML5, providing an open-source remote cloud/HPC graphics or game streaming platform.
https://github.com/selkies-project/docker-nvidia-glx-desktop/pkgs/container/nvidia-glx-desktop
Mozilla Public License 2.0
324 stars 67 forks source link

X11 not starting #29

Closed DAB0mB closed 1 year ago

DAB0mB commented 2 years ago

Hello! Kudos for this great image. I've been trying to run a deployment on my Kubernetes cluster with nvidia-glx-desktop. Sometimes it works, but sometimes it doesn't (could be how things are cached by the cloud provider?). According to the logs, it looks like the entrypoint scripts are waiting for X11 to start and are stuck in a perpetual loop. Why do you think X11 isn't starting and how can I manually start it?

ehfd commented 2 years ago

Hi! First check the https://github.com/ehfd/docker-nvidia-glx-desktop#troubleshooting section. If that doesn't work for you, please post all three logs located in /tmp.

DAB0mB commented 2 years ago

Here are the log files in case of failure, including Xorg.0.log:

entrypoint-fail-stdout.log pulseaudio-fail-stdout.log selkies-gstreamer-fail-stdout.log Xorg.0.fail.log

And here are the log files in case of success:

entrypoint-stdout.log pulseaudio-stdout.log selkies-gstreamer-stdout.log Xorg.0.log

Here's the xorg.conf:

xorg.conf.txt

According to Xorg.0.fail.log:

Screen(s) found, but none have a usable configuration.

To me it seems like everything is setup correctly but there seems to be something with the nodes, I have no access to any of them whatsoever but I can submit a ticket to the cloud provider if necessary.

ehfd commented 2 years ago

@DAB0mB Thanks for your cooperation. It seems to be that NVIDIA_DRIVER_CAPABILITIES is insufficient. Please make sure that it is sent to all, or includes utility,graphics,video,display, by asking your provider. It won't work just by changing it inside the container.

ehfd commented 2 years ago

If this works, please don't close. I have to update it in the docs.

ehfd commented 2 years ago

If not possible, docker-nvidia-egl-desktop may much more likely work.

DAB0mB commented 2 years ago

Thank you! Will update shortly once they get back to me

ehfd commented 2 years ago

Any updates? @DAB0mB + I explained in https://github.com/ehfd/docker-nvidia-glx-desktop#the-container-doesnt-work what each NVIDIA_DRIVER_CAPABILITIES entry does for the container.

DAB0mB commented 2 years ago

Hey @ehfd, they're trying to look at it, I've been told that NVIDIA_DRIVER_CAPABILITIES is set to all by default and that it doesn't seem to be the issue. I will update if I have anything.

ehfd commented 2 years ago

Note that the NVIDIA A100 in MIG mode does not support any GUI operations. No OpenGL and Vulkan, no X.Org.

ehfd commented 2 years ago

@DAB0mB One more possible cause, is there an X server on the host? In other words, is there a GUI on the host? Or is there more than one X server trying to run for one GPU?

ehfd commented 2 years ago

I also updated the container.

jfeldman325 commented 2 years ago

@DAB0mB We have identified that the issue on CoreWeave is likely due to k8s container memory requirements for the shm feature in x11. Adding something like the following should resolve the issue:

volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - mountPath: /cache
          name: cudaglx-cache-vol
        - mountPath: /home/user
          name: cudaglx-root-vol
      imagePullSecrets:
      - name: render-images
      volumes:
      - emptyDir:
          medium: Memory
        name: dshm
      - name: cudaglx-cache-vol
      - name: cudaglx-root-vol
ehfd commented 2 years ago

https://github.com/ehfd/docker-nvidia-glx-desktop/blob/main/xgl.yml

This is in the yaml file.

Please reopen if any issues arise again.

ehfd commented 2 years ago

I need to add this to the docs too.

ehfd commented 1 year ago

Docs added.