selkies-project / docker-nvidia-glx-desktop

KDE Plasma Desktop container designed for Kubernetes, supporting OpenGL EGL and GLX, Vulkan, and Wine/Proton for NVIDIA GPUs through WebRTC and HTML5, providing an open-source remote cloud/HPC graphics or game streaming platform.
https://github.com/selkies-project/docker-nvidia-glx-desktop/pkgs/container/nvidia-glx-desktop
Mozilla Public License 2.0
322 stars 67 forks source link

Docker Compose #14

Closed johncadengo closed 3 years ago

johncadengo commented 3 years ago

I'm trying to convert your example of using a docker run command docker run --gpus 1 -it -e TZ=UTC -e SIZEW=1920 -e SIZEH=1080 -e SHARED=TRUE -e PASSWD=mypasswd -e VIDEO_PORT=DFP -p 8080:8080 ehfd/nvidia-glx-desktop:latest into a docker compose file.

Here is my file:

version: '3.8'
services:
  nvidia-glx-desktop:
    image: 'ehfd/nvidia-glx-desktop:latest'
    environment:
      - TZ=UTC
      - SIZEW=1920
      - SIZEH=1080
      - SHARED=TRUE
      - PASSWD=mypasswd
      - VIDEO_PORT=DFP
    ports:
      - '8080:8080'
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu, utility]

When I run it with your command, xrandr returns the correct value, mimicking a screen. However, when I run it with docker compose, xrandr is a virtual screen, at something like 32000 x 32000 screen size. Is there some subtle difference I'm not understanding?

ehfd commented 3 years ago

Please upload the /tmp/bootstrap--stdout.log and /var/log/Xorg.0.log files.

johncadengo commented 3 years ago

/tmp/bootstrap--stdout.log https://pastebin.com/irk8YWBA

/var/log/Xorg.0.log https://pastebin.com/AFhWZzgQ

johncadengo commented 3 years ago

By the way, I have 2 GPUs on this system. So I am just sharing one with the docker container in testing.

ehfd commented 3 years ago

While the bootstrap-stdout.log is an invalid link and thus I cannot deduce the whole issue, try using DP-0 for VIDEO_PORT. I see that it is a Quadro GPU. If it doesn't work, I think it might be a driver thing, either in the compose settings or the container toolkit.

johncadengo commented 3 years ago

My apologies, a few characters were cut off in the copy and paste. Here's the link (and updated the original comment): https://pastebin.com/irk8YWBA

johncadengo commented 3 years ago

Changing the environment variable to VIDEO_PORT=DP-0 did not change the xrandr output. It's still the same. Maybe the bootstrap log will help.

ehfd commented 3 years ago

I might need to replicate stuff. I'll try using docker-compose myself. Please be a bit patient.

ehfd commented 3 years ago
version: '3.8'
services:
  glx:
    image: 'ghcr.io/ehfd/nvidia-glx-desktop:latest'
    environment:
      - TZ=UTC
      - SIZEW=1920
      - SIZEH=1080
      - SHARED=TRUE
      - PASSWD=mypasswd
      - VIDEO_PORT=DFP
    ports:
      - '8080:8080'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu, utility]
    stdin_open: true
    tty: true

I was unable to reproduce any issues with this configuration. Try using the latest docker-compose version rather than the one installed with the package manager. Also, you have to take down a container that you have started up with docker-compose before starting a new one.

johncadengo commented 3 years ago

I just tested it with your docker-compose configuration and am still encountering the same error. I updated docker-compose to the latest version for my kernel, which is:

$ docker-compose --version                                                                                                                         
docker-compose version 1.29.2, build 5becea4c

Still experiencing the issue. I'll try other things to debug it.

ehfd commented 3 years ago

Strongly suspect an issue in your driver installation or NVIDIA container runtime.

johncadengo commented 3 years ago

Well, what's weird is that it works when I use docker run. It just doesn't work when I'm using docker compose, so I'm not sure what about the driver installation is different between those two commands. Seems like there might be an implicit configuration being set differently between the two commands? Or a different level of privilege or access set by default?

ehfd commented 3 years ago

The code in the midst of an overhaul and will get public around next week. While it seems unrelated to this issue now, it could change things.

johncadengo commented 3 years ago

@ehfd looking forward to the updates.

Just wanted to let you know, for some reason, the most recent version of this repo that works on my computer is the commit from March 12: https://github.com/ehfd/docker-nvidia-glx-desktop/commit/cec9907cf2ad826aac53946e40bb9226fc4ea5b1

The commits after that get stuck at the bootstrap phase for some reason. I could dig further and grab you the logs later, but just thought to let you know.

johncadengo commented 3 years ago

Ok, here's two interesting things I ran into:

  1. For some reason, I had a version of the Nvidia drivers installed that were not on the site you use to bootstrap the drivers: https://download.nvidia.com/XFree86/Linux-x86_64/

My version was 460.106.00, which must have come from a PPA or something. I'm not sure. However, because it wasn't on that website, it wasn't working. I ended up finding a version that matched one of the available versions and that worked for me.

  1. The site itself was changed in your commits over time, because originally it was https://us.download.nvidia.com/XFree86/Linux-x86_64/ and the us subdomain was later dropped. So that's important to note in using older versions of this repo. It basically renders any of the older commits unusable unless the user updates that URL manually.
ehfd commented 3 years ago

Ok, I understand what is the issue. Same issue as #16 then it is.

ehfd commented 3 years ago

New release with commit 952ff0c8ca3161c7f38146d775b3b9826b4dc06a

johncadengo commented 3 years ago

@ehfd Thanks for the update. Looks like a lot of great changes! 💯

In the readme, you mention that you should only start up one xserver per GPU. You're referring to the guest xservers, I'm assuming, so only 1 guest container per GPU? Is there any plan to support multiple containers per GPU, either in this or the EGL repo?

Also, the gstreamer interface looks really promising. What are the advantages over novnc? Is it for audio support or is it also for better performance?

Great job, and thanks again!

ehfd commented 3 years ago

In the readme, you mention that you should only start up one xserver per GPU. You're referring to the guest xservers, I'm assuming, so only 1 guest container per GPU? Is there any plan to support multiple containers per GPU, either in this or the EGL repo?

Yes, there are only guest X servers and no host X servers with the GLX container, and it supports 1 guest container per GPU out of the box. However, it's possible to create multiple screens by allocating each screen to a different physical video port which involves changing the entrypoint.sh script and invoking a noVNC or WebRTC instance on different ports for each screen. The EGL container (to be updated to support WebRTC) supports multiple containers per GPU out of the box and at the same time also has fallback capabilities to software acceleration because it does not use an Xorg server with NVIDIA drivers, but still will have the restrictions such as having no Vulkan. Things are expected to become more flexible when the NVIDIA Wayland compatibility matures, and some new browser capabilities in the future are implemented.

Also, the gstreamer interface looks really promising. What are the advantages over novnc? Is it for audio support or is it also for better performance?

It uses the same underlying protocols as common "game streaming" services such as Parsec, Rainway, GeForce NOW, Google Stadia, and others (all supporting Windows hosts only, if they indeed support user provided hosts). It works well in conditions that require bleeding edge graphics capabilities as it uses H.264 AVC instead of libjpeg-turbo (RFB/VNC) or libpng (Guacamole). Performance where frequent screen refreshes are required seems to be WebRTC >>> noVNC > Guacamole, while noVNC does not support audio as well. But WebRTC is more complicated to setup if it requires a TURN server, and it is a compromise to achieve latency incapable with WebSockets (but this will change over the years).

https://cloud.google.com/architecture/gpu-accelerated-streaming-using-webrtc Selkies-gstreamer was developed by the person who wrote this.

https://dx.doi.org/10.13140/RG.2.2.29960.96005 And I wrote this (to be updated to explain the new release).

seanrmurphy commented 3 years ago

I'm not sure I fully understand the point about single X session per GPU, but I can report that we have been able to get 20 running X instances on a single T4 using nvidia-docker as the runtime in our kubernetes cluster (This was just a simple test to get some understanding of possibilities - in this case, we were actually memory bound on the VM and for these vanilla X sessions doing nothing, we could prob have put more on the T4 - we will do some slightly more demanding dimensioning experiments and can put a few notes on this thread).

johncadengo commented 3 years ago

@seanrmurphy thanks for sharing your experience. I'd really appreciate if you could put a few notes on this thread, and I'll share what I can after trying to replicate your results. I'm personally not using Kubernetes, just docker, but it would be great to see how you're able to get 20 X instances up at once. That sounds great.

johncadengo commented 3 years ago

@ehfd great work. Thanks for sharing your research. Very interesting stuff. (I'm a UCSD alum, so it's great to see my university affiliated with this work). I'm excited to see how the webRTC performs in my use cases and I'm glad to see so much progress in the development of this idea.

ehfd commented 3 years ago

I'm not sure I fully understand the point about single X session per GPU, but I can report that we have been able to get 20 running X instances on a single T4 using nvidia-docker as the runtime in our kubernetes cluster (This was just a simple test to get some understanding of possibilities - in this case, we were actually memory bound on the VM and for these vanilla X sessions doing nothing, we could prob have put more on the T4 - we will do some slightly more demanding dimensioning experiments and can put a few notes on this thread).

Might be a difference in behavior with Datacenter GPUs and Consumer GPUs. Very welcome to hear more about it, and this could be a great bonus. Please extend this in #11.

ehfd commented 3 years ago

Well, what's weird is that it works when I use docker run. It just doesn't work when I'm using docker compose, so I'm not sure what about the driver installation is different between those two commands. Seems like there might be an implicit configuration being set differently between the two commands? Or a different level of privilege or access set by default?

@johncadengo So, was this issue resolved?

johncadengo commented 3 years ago

@ehfd yes, this issue had to do with the nvidia driver. Thanks for helping me along with it. I might be needing some more help, but I will create another issue for it after I've troubleshooted.

ehfd commented 3 years ago

@ehfd yes, this issue had to do with the nvidia driver. Thanks for helping me along with it. I might be needing some more help, but I will create another issue for it after I've troubleshooted.

Closing for now then.

ehfd commented 2 years ago

Further note from this issue, capabilities: [gpu, utility] is not enough, either capabilities: all should be set or the inclusion of graphics and display is required.

ehfd commented 2 years ago

Added in Documentation.