selkies-project / docker-nvidia-glx-desktop

KDE Plasma Desktop container designed for Kubernetes, supporting OpenGL EGL and GLX, Vulkan, and Wine/Proton for NVIDIA GPUs through WebRTC and HTML5, providing an open-source remote cloud/HPC graphics or game streaming platform.
https://github.com/selkies-project/docker-nvidia-glx-desktop/pkgs/container/nvidia-glx-desktop
Mozilla Public License 2.0
315 stars 65 forks source link

Manually install the compat version of the CUDA toolkit in the container #44

Closed remram44 closed 1 year ago

remram44 commented 1 year ago

Currently there are only 4 tags 18.04, 20.04, 22.04, latest. Whenever you release a new version, you update every tag and the previous images are gone.

It would be nice to be able to keep referencing a specific version or to grab an old release. For example you could keep 22.04-20230801 (never updated) beside 22.04 (updated).

Case in point: it is impossible to get any Xfce images since every tag has been overwritten since the switch to KDE

ehfd commented 1 year ago

There were issues with the Xfce images affecting usage (a pretty peculiar bug that looked like simple mistakes from the dev that wasn't patched and just blacked out everything at startup) that made me discourage maintaining them. And critical issues in the MATE images beforehand as well (issue with NVIDIA drivers > 490 specific to the DE not patched in Focal), related to the DE too. Therefore, I cannot accept any maintenance requests for Xfce, and this was why it was removed.

KDE isn't perfect, but at least it's maintainable. (And this container is for VR as well so I had to choose between GNOME and KDE which supported XRDesktop - of which GNOME doesn't work without systemd).

And this project is always a rolling release, there are issues with every container before that which is fixed in the new ones.

ehfd commented 1 year ago

(The issue with NVIDIA's 535 driver right now is with Xorg, which is DE-independent.)

remram44 commented 1 year ago

Oh I'm not saying you should maintain them, just offer a way to find them, like "normal" release artifacts would be findable.

remram44 commented 1 year ago

Here is Microsoft themselves recommending it: https://learn.microsoft.com/en-us/azure/container-registry/container-registry-image-tag-version

If you don't want to that's fine, I will have to copy your images somewhere else for deployment. Having containers break or suddenly change from Xfce to KDE just because it restarted on a different node and pulled something new is not something I want my users to deal with.

remram44 commented 1 year ago

I am automatically copying images to remram44/selkies-docker-nvidia-glx-desktop for everyone wanting stable tags.

ehfd commented 1 year ago

Alright, apparently, at least two prolific external users are already clearly unsatisfied of this decision. And I'm not stubborn enough to keep it that way.

I don't want a split in the userbase, and I am willing to find a solution that is maintainable AND keep users happy about it. I know that what I did wasn't exactly best practices, rather compromises for cognitive load and making the minimum level of mistakes.

But I have limited time to maintain this for at least another 12 months, and what's keeping me from work is academic matters, thus providing funding will not change things for me (and as this project was never for the income, I will also not accept substantial responsibilities arising from substantial funding in the meantime).

I propose the following as the solution:

  1. The CI process can be improved to fetch a range of CUDA image versions and be built concurrently, and be tagged as such. I already put the ARG argument to do that.

  2. I can make tags like how Arch Linux does for their Docker images like base-20230709.0.163418. This is what people want, okay.

  3. The KDE update involved splitting the desktop environment and rest of the components in the Dockerfile. Xfce images can be maintained slightly more easier in the same way. Thus, after building the base images with separate CUDA image versions and OS versions, the CI process of installing the desktop environment can be technically separated.

However, for Number 3, new bugs arise as time passes and from making changes that are seemingly irrelevant. I cannot actively use Xfce images because I am biased towards KDE. I also use a specific driver and CUDA version and I cannot fix things that happen in different ones. Thus, this is too big of a work for one person and adds bugs I cannot fix.

ehfd commented 1 year ago

tl;dr: If you have a configuration (desktop environment, additional wide range packages) that you use frequently and you want it to be supported, I will do it if you continuously provide bug reports.

I will only implement the "multiple CUDA versions" section in this proposal and will not implement tagged builds if there is no such volunteer. Tagged builds will multiply the number of bugs to manage exponentially while all tags do technically the same thing.

Else, our development direction is probably your best shot, and you should keep your own forked image builds to choose when to accept our breaking changes.

I still do not recommend keeping on our old builds (at least rebuild frequently if you use an old Dockerfile), because who knows what would happen from old ca-bundle and ca-certificate versions, old Firefox versions, another OpenSSL CVE, some random security package, and whatever else?...

ehfd commented 1 year ago

Key issues I faced with Xfce that you should probably read:

  1. When the ~/.cache or ~/.config/xfce4 is not cleared before container start, for some reason, the desktop environment starts with a black screen. It wasn't fixed in Focal.
  2. Xfce actually takes on more disk space and RAM (compared to KDE), primarily because they have programs part of the stack which was borrowed from GNOME, MATE, and other desktop environments. This primarily leads to not being able to share dependencies across programs.
  3. Its compositor is intensive and impacts performance. Also, it's hard to disable it in the container, as the home directory must be tinkered. KDE with compositor disabled (which should be disabled with remote desktops or VR/gaming applications anyway) has better results.
  4. Even if it's pretty intensive, it really offers not much. The default theme (which is hard to customize within containers) offers not much capabilities.
  5. The number of developers are limited and the backend GTK+ stack itself is falling back with time. While Xfce developers are making heroic efforts to keep maintaining, the dependencies are falling behind compared to GNOME and other GTK+ desktop environments. KDE has multiples of more developers working on the project, and you can go to their IRC channels to ask about capabilities instead of relying on me.
remram44 commented 1 year ago

The plasma shell doesn't show at all for me, so the container broke by surprise. I can debug the new version, but I like my upgrades to be planned.

I'm running the containers from persistent volumes so I can apt-get upgrade.

ehfd commented 1 year ago

So, the settlement:

  1. Intel's Clear Linux manages container tags the same way as I do. I wasn't the only person doing this practice. But I acknowledge that I make breaking changes including going from one desktop environment to another. This breaks some users.

  2. I cannot do tagged releases (Debian/Ubuntu model). This substantially slows my development speed that I made, and every tag would is always deprecated by the next one.

  3. So, my compromise is the Arch Linux model, where the tags are tagged by the time it was built.

  4. In addition, I can also publish a more diverse set of CUDA versions for the driver support matrices.

Everyone happier than before?

ehfd commented 1 year ago

@remram44 All builds are now also tagged with the UTC (Zulu) time of each commit, with the form ghcr.io/selkies-project/nvidia-glx-desktop:22.04-20230906134219 and ghcr.io/selkies-project/nvidia-glx-desktop:20.04-20230906134219.

Please tell me if this is what you want. Again, thanks for your feedback and helping me resolve repository behaviors into best-practice.

This issue will still be kept open, as I would like to manually install cuda-compat in the container, to support all driver versions since 450 while opportunistically upgrading the CUDA version.

ehfd commented 1 year ago

One discovery is that cuda-compat is already installed to the containers. Thus, all that is required is to add LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH.

An observation is that this directory has the CUDA userspace drivers (separate from kernel modules) libcuda.so.1, libnvidia-nvvm.so.4, and libnvidia-ptxjitcompiler.so.1, with its version matching its own driver version counterpart, taking priority over the host's own CUDA drivers.

https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatible-upgrade https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.2.0/ubuntu2004/base/Dockerfile

https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatible-upgrade https://docs.heavy.ai/installation-and-configuration/installation/upgrading-omnisci/cuda-compatibility-drivers

I am, however, unsure of whether GeForce GPUs are supported. Need to test 1070/80, 20xx, 30xx, 40xx.

remram44 commented 1 year ago

I'm going to unfollow this issue since it seems to have completely changed topic, from Docker images history to CUDA dylibs. Good luck with CUDA!

ehfd commented 1 year ago

@remram44 One thing. There was a major CVE with libwebp and libvpx. If you have images built before Oct 3, you should rebuild it because it was a major security issue.

https://ubuntu.com/security/notices/USN-6369-1 https://ubuntu.com/security/notices/USN-6403-1

This is the "security concern" I talked about.

ehfd commented 1 year ago

So far (updating): Major version forward compatibility from Driver 515.xx (CUDA 11.7) to CUDA 12.2 with forward compatibility libraries in path: Forward compatibility working on: Quadro M6000, Tesla V100 SXM2 32GB, Tesla T4, RTX A6000, A10, A40, A100 PCIE 40GB, A100 SXM4 80GB Forward compatibility not working on: GTX 1070, GTX 1080 Ti, TITAN Xp, RTX 2080 Ti, RTX 3090

Failed to init cuda, cuInit ret: 0x324: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW

Minor version forward compatibility from Driver 525.xx (CUDA 12.0) to CUDA 12.2 with forward compatibility libraries in path: Forward compatibility not working on: RTX 3090

Failed to init cuda, cuInit ret: 0x324: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW

Minor version forward compatibility from Driver 530.xx (CUDA 12.1) to CUDA 12.2 with forward compatibility libraries in path: Forward compatibility not working on: GTX 1080, GTX 1080 Ti, RTX 3090, RTX 4090, Quadro RTX 6000, RTX A4000, RTX A6000

This is documented as not supported because 530.xx is a New Feature Branch. Failed to init cuda, cuInit ret: 0x323: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination

So far, pretty useless feature because it shows the CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE or CUDA_ERROR_SYSTEM_DRIVER_MISMATCH error even when minor version forward compatibility is supposed to exist in GeForce GPUs!

ehfd commented 1 year ago

My final solution:

# Extract NVRTC dependency, https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/LICENSE.txt
cd /tmp && curl -fsSL -o nvidia_cuda_nvrtc_linux_x86_64.whl "https://developer.download.nvidia.com/compute/redist/nvidia-cuda-nvrtc/nvidia_cuda_nvrtc-11.0.221-cp36-cp36m-linux_x86_64.whl" && unzip -joq -d ./nvrtc nvidia_cuda_nvrtc_linux_x86_64.whl && cd nvrtc && chmod 755 libnvrtc* && find . -maxdepth 1 -type f -name "*libnvrtc.so.*" -exec sh -c 'ln -snf $(basename {}) libnvrtc.so' \; && mv -f libnvrtc* /opt/gstreamer/lib/x86_64-linux-gnu/ && cd /tmp && rm -rf /tmp/*

Since Selkies-GStreamer only requires libnvrtc.so, I extracted the libraries from a .whl file and put it in /opt/gstreamer/lib/x86_64-linux-gnu. Eliminated the whole CUDA runtime with this and supports NVIDIA drivers >= 450.

Discussion continues: https://gitlab.freedesktop.org/gstreamer/gstreamer/-/issues/3108

ehfd commented 1 year ago

Just an FYI: You'll like the latest commit. @remram44