Closed remram44 closed 1 year ago
There were issues with the Xfce images affecting usage (a pretty peculiar bug that looked like simple mistakes from the dev that wasn't patched and just blacked out everything at startup) that made me discourage maintaining them. And critical issues in the MATE images beforehand as well (issue with NVIDIA drivers > 490 specific to the DE not patched in Focal), related to the DE too. Therefore, I cannot accept any maintenance requests for Xfce, and this was why it was removed.
KDE isn't perfect, but at least it's maintainable. (And this container is for VR as well so I had to choose between GNOME and KDE which supported XRDesktop - of which GNOME doesn't work without systemd).
And this project is always a rolling release, there are issues with every container before that which is fixed in the new ones.
(The issue with NVIDIA's 535 driver right now is with Xorg, which is DE-independent.)
Oh I'm not saying you should maintain them, just offer a way to find them, like "normal" release artifacts would be findable.
Here is Microsoft themselves recommending it: https://learn.microsoft.com/en-us/azure/container-registry/container-registry-image-tag-version
If you don't want to that's fine, I will have to copy your images somewhere else for deployment. Having containers break or suddenly change from Xfce to KDE just because it restarted on a different node and pulled something new is not something I want my users to deal with.
I am automatically copying images to remram44/selkies-docker-nvidia-glx-desktop for everyone wanting stable tags.
Alright, apparently, at least two prolific external users are already clearly unsatisfied of this decision. And I'm not stubborn enough to keep it that way.
I don't want a split in the userbase, and I am willing to find a solution that is maintainable AND keep users happy about it. I know that what I did wasn't exactly best practices, rather compromises for cognitive load and making the minimum level of mistakes.
But I have limited time to maintain this for at least another 12 months, and what's keeping me from work is academic matters, thus providing funding will not change things for me (and as this project was never for the income, I will also not accept substantial responsibilities arising from substantial funding in the meantime).
I propose the following as the solution:
The CI process can be improved to fetch a range of CUDA image versions and be built concurrently, and be tagged as such. I already put the ARG
argument to do that.
I can make tags like how Arch Linux does for their Docker images like base-20230709.0.163418
. This is what people want, okay.
The KDE update involved splitting the desktop environment and rest of the components in the Dockerfile. Xfce images can be maintained slightly more easier in the same way. Thus, after building the base images with separate CUDA image versions and OS versions, the CI process of installing the desktop environment can be technically separated.
However, for Number 3, new bugs arise as time passes and from making changes that are seemingly irrelevant. I cannot actively use Xfce images because I am biased towards KDE. I also use a specific driver and CUDA version and I cannot fix things that happen in different ones. Thus, this is too big of a work for one person and adds bugs I cannot fix.
tl;dr: If you have a configuration (desktop environment, additional wide range packages) that you use frequently and you want it to be supported, I will do it if you continuously provide bug reports.
I will only implement the "multiple CUDA versions" section in this proposal and will not implement tagged builds if there is no such volunteer. Tagged builds will multiply the number of bugs to manage exponentially while all tags do technically the same thing.
Else, our development direction is probably your best shot, and you should keep your own forked image builds to choose when to accept our breaking changes.
I still do not recommend keeping on our old builds (at least rebuild frequently if you use an old Dockerfile), because who knows what would happen from old ca-bundle and ca-certificate versions, old Firefox versions, another OpenSSL CVE, some random security package, and whatever else?...
Key issues I faced with Xfce that you should probably read:
~/.cache
or ~/.config/xfce4
is not cleared before container start, for some reason, the desktop environment starts with a black screen. It wasn't fixed in Focal.The plasma shell doesn't show at all for me, so the container broke by surprise. I can debug the new version, but I like my upgrades to be planned.
I'm running the containers from persistent volumes so I can apt-get upgrade.
So, the settlement:
Intel's Clear Linux manages container tags the same way as I do. I wasn't the only person doing this practice. But I acknowledge that I make breaking changes including going from one desktop environment to another. This breaks some users.
I cannot do tagged releases (Debian/Ubuntu model). This substantially slows my development speed that I made, and every tag would is always deprecated by the next one.
So, my compromise is the Arch Linux model, where the tags are tagged by the time it was built.
In addition, I can also publish a more diverse set of CUDA versions for the driver support matrices.
Everyone happier than before?
@remram44 All builds are now also tagged with the UTC (Zulu) time of each commit, with the form ghcr.io/selkies-project/nvidia-glx-desktop:22.04-20230906134219
and ghcr.io/selkies-project/nvidia-glx-desktop:20.04-20230906134219
.
Please tell me if this is what you want. Again, thanks for your feedback and helping me resolve repository behaviors into best-practice.
This issue will still be kept open, as I would like to manually install cuda-compat
in the container, to support all driver versions since 450 while opportunistically upgrading the CUDA version.
One discovery is that cuda-compat
is already installed to the containers. Thus, all that is required is to add LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH
.
An observation is that this directory has the CUDA userspace drivers (separate from kernel modules) libcuda.so.1
, libnvidia-nvvm.so.4
, and libnvidia-ptxjitcompiler.so.1
, with its version matching its own driver version counterpart, taking priority over the host's own CUDA drivers.
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatible-upgrade https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.2.0/ubuntu2004/base/Dockerfile
https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatible-upgrade https://docs.heavy.ai/installation-and-configuration/installation/upgrading-omnisci/cuda-compatibility-drivers
I am, however, unsure of whether GeForce GPUs are supported. Need to test 1070/80, 20xx, 30xx, 40xx.
I'm going to unfollow this issue since it seems to have completely changed topic, from Docker images history to CUDA dylibs. Good luck with CUDA!
@remram44 One thing. There was a major CVE with libwebp and libvpx. If you have images built before Oct 3, you should rebuild it because it was a major security issue.
https://ubuntu.com/security/notices/USN-6369-1 https://ubuntu.com/security/notices/USN-6403-1
This is the "security concern" I talked about.
So far (updating): Major version forward compatibility from Driver 515.xx (CUDA 11.7) to CUDA 12.2 with forward compatibility libraries in path: Forward compatibility working on: Quadro M6000, Tesla V100 SXM2 32GB, Tesla T4, RTX A6000, A10, A40, A100 PCIE 40GB, A100 SXM4 80GB Forward compatibility not working on: GTX 1070, GTX 1080 Ti, TITAN Xp, RTX 2080 Ti, RTX 3090
Failed to init cuda, cuInit ret: 0x324: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
Minor version forward compatibility from Driver 525.xx (CUDA 12.0) to CUDA 12.2 with forward compatibility libraries in path: Forward compatibility not working on: RTX 3090
Failed to init cuda, cuInit ret: 0x324: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
Minor version forward compatibility from Driver 530.xx (CUDA 12.1) to CUDA 12.2 with forward compatibility libraries in path: Forward compatibility not working on: GTX 1080, GTX 1080 Ti, RTX 3090, RTX 4090, Quadro RTX 6000, RTX A4000, RTX A6000
This is documented as not supported because 530.xx is a New Feature Branch.
Failed to init cuda, cuInit ret: 0x323: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination
So far, pretty useless feature because it shows the CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE
or CUDA_ERROR_SYSTEM_DRIVER_MISMATCH
error even when minor version forward compatibility is supposed to exist in GeForce GPUs!
My final solution:
# Extract NVRTC dependency, https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/LICENSE.txt
cd /tmp && curl -fsSL -o nvidia_cuda_nvrtc_linux_x86_64.whl "https://developer.download.nvidia.com/compute/redist/nvidia-cuda-nvrtc/nvidia_cuda_nvrtc-11.0.221-cp36-cp36m-linux_x86_64.whl" && unzip -joq -d ./nvrtc nvidia_cuda_nvrtc_linux_x86_64.whl && cd nvrtc && chmod 755 libnvrtc* && find . -maxdepth 1 -type f -name "*libnvrtc.so.*" -exec sh -c 'ln -snf $(basename {}) libnvrtc.so' \; && mv -f libnvrtc* /opt/gstreamer/lib/x86_64-linux-gnu/ && cd /tmp && rm -rf /tmp/*
Since Selkies-GStreamer only requires libnvrtc.so
, I extracted the libraries from a .whl
file and put it in /opt/gstreamer/lib/x86_64-linux-gnu
. Eliminated the whole CUDA runtime with this and supports NVIDIA drivers >= 450.
Discussion continues: https://gitlab.freedesktop.org/gstreamer/gstreamer/-/issues/3108
Just an FYI: You'll like the latest commit. @remram44
Currently there are only 4 tags
18.04
,20.04
,22.04
,latest
. Whenever you release a new version, you update every tag and the previous images are gone.It would be nice to be able to keep referencing a specific version or to grab an old release. For example you could keep
22.04-20230801
(never updated) beside22.04
(updated).Case in point: it is impossible to get any Xfce images since every tag has been overwritten since the switch to KDE