lebensterben / awesome-clear-linux

Let's make Clear Linux distribution great
GNU General Public License v3.0
69 stars 20 forks source link

Unable to load the nvidia-drm kernel module #9

Closed jonahbron closed 4 years ago

jonahbron commented 4 years ago

I just ran install.sh today and got this error:

ERROR: Unable to load the 'nvidia-drm' kernel module.

Kernel version is 5.4.42-40.lts2019, Clear Linux system version is 33180.

What could cause this?

lebensterben commented 4 years ago

I just ran install.sh today and got this error: ...

@jonahbron Do you see this error when you run the script or after you run it?

jonahbron commented 4 years ago

During. Here's an image.

MVIMG_20200525_102128

jonahbron commented 4 years ago

I managed to get my system back up by uninstalling the NVidia driver and rolling back to kernel lts2018 (instead of the current lts2019).

I think this may be an issue with the driver itself (440.82). I think that may be the case because I've tried installing with several different kernel versions, but none of them work. I might try to install an older driver version.

jonahbron commented 4 years ago

Tried with a different version of the NVidia driver (435.21). No change in behavior, still unable to load nvidia-drm.

jonahbron commented 4 years ago

I thought a full system wipe might fix it. I reinstalled Clear Linux from scratch with the lts kernel. I'm still getting the same error. Tried with both the native and lts versions.

lebensterben commented 4 years ago

I thought a full system wipe might fix it. I reinstalled Clear Linux from scratch with the lts kernel. I'm still getting the same error. Tried with both the native and lts versions. ...

@jonahbron I can reproduce this error.

nvidia-drm is a kernel module required by NVIDIA proprietary driver. During installation, the installer also add a Xorg config file to load this module.

It's confusing to me that the build log of dkms module says it succeeded and then the next line is an error of missing nvidia-drm.

FanFani4 commented 4 years ago

i get same error - tried different kernel versions - and none is working, any ideas how to fix this ?

jonahbron commented 4 years ago

Relieved to hear it's not just me 😄 . I posted about this on the NVidia forums, so watch this thread.

https://forums.developer.nvidia.com/t/clear-linux-error-unable-to-load-the-nvidia-drm-kernel-module-gtx-970/124836/3

If a solution is found, I'll update this issue so we can patch the installer script.

My only idea is that I think Clear Linux recently update the GCC version. I wonder if that could be related.

jonahbron commented 4 years ago

Also created an issue for Clear Linux directly.

https://github.com/clearlinux/distribution/issues/1994

FanFani4 commented 4 years ago

rolling back version fixed issue for me

swupd verify --fix --picky -m 32990

not sure if it is safe to do it that way, but drivers are working :)

lebensterben commented 4 years ago

rolling back version fixed issue for me ...

@FanFani4 Can you post this here https://github.com/clearlinux/distribution/issues/1994

jonahbron commented 4 years ago

I can afford to wait, so I won't roll back yet. Hopefully we can use my machine to find the cause.

insilications commented 4 years ago

Could be because of gcc 10. Someone was saying the problem disappears with gcc 10 test version https://aur.archlinux.org/packages/nvidia-390xx-dkms/

insilications commented 4 years ago

After two days trying to solve this, I've finally managed to install 440.82 with the latest kernel-native (custom built with SECTION_MISMATCH_WARN_ONLY). Perhaps nvidia or intel staff will clarify what's at fault here, but there is some problematic interaction between 5.6.15 kernel vs GCC 10.1 vs nvidia 440.92 installer in 1) creating the proper dkms build and source tree in "/var/lib/dkms/nvidia/440.82/" and 2) issuing the proper install command to the /usr/bin/dkms tool.

Using the following installation command (per https://docs.01.org/clearlinux/latest/tutorials/nvidia.html plus "--no-cc-version-check" just as a guarantee and "--expert" instead of "--silent" for a more verbose installation)

sudo ./NVIDIA-Linux-x86_64-440.82.run \     --utility-prefix=/opt/nvidia \     --opengl-prefix=/opt/nvidia \     --compat32-prefix=/opt/nvidia \     --compat32-libdir=lib32 \     --x-prefix=/opt/nvidia \     --x-module-path=/opt/nvidia/lib64/xorg/modules \     --x-library-path=/opt/nvidia/lib64 \     --x-sysconfig-path=/etc/X11/xorg.conf.d \     --documentation-prefix=/opt/nvidia \     --application-profile-path=/etc/nvidia/nvidia-application-profiles-rc.d \     --no-precompiled-interface \     --no-distro-scripts \     --force-libglx-indirect \     --glvnd-egl-config-path=/etc/glvnd/egl_vendor.d \     --egl-external-platform-config-path=/etc/egl/egl_external_platform.d \     --dkms \     --no-cc-version-check \     --expert

Using the "--expert" options reveals why the installer issues "ERROR: Unable to load the 'nvidia-drm' kernel module" without any explanation at all:

-> Driver file installation is complete. -> Installing DKMS kernel module: -> done. ERROR: Unable to load the 'nvidia-drm' kernel module: 'modprobe: ERROR: ctx=0x5646828152a0 path=/lib/modules/5.6.15-957.native/kernel/drivers/video/nvidia-modeset.ko error=No such file or directory modprobe: ERROR: ctx=0x5646828152a0 path=/lib/modules/5.6.15-957.native/kernel/drivers/video/nvidia-modeset.ko error=No such file or directory modprobe: ERROR: could not insert 'nvidia_drm': Unknown symbol in module, or unknown parameter (see dmesg)'

The DKMS kernel modules are being built, but they are not being finally installed from the proper build directory to the corresponding kernel modules path. That is why they are not being loaded by modprobe. ls /var/lib/dkms/nvidia/440.82/ reveals the following:

source/ build/

The correct source symlink directory pointing to the nvidia kernel modules sources. And the build symlink directory which after the successful build should hold the resulting binaries, make.log, etc. And that is not happening. build/ is an empty directory. The solution is to manually build and install the dkms nvidia source tree. Here is the fix, after rebuilding the kernel with "SECTION_MISMATCH_WARN_ONLY". Start by installing the driver again, this time with the --silent flag instead of the --expert flag:

sudo ./NVIDIA-Linux-x86_64-440.82.run \     --utility-prefix=/opt/nvidia \     --opengl-prefix=/opt/nvidia \     --compat32-prefix=/opt/nvidia \     --compat32-libdir=lib32 \     --x-prefix=/opt/nvidia \     --x-module-path=/opt/nvidia/lib64/xorg/modules \     --x-library-path=/opt/nvidia/lib64 \     --x-sysconfig-path=/etc/X11/xorg.conf.d \     --documentation-prefix=/opt/nvidia \     --application-profile-path=/etc/nvidia/nvidia-application-profiles-rc.d \     --no-precompiled-interface \     --no-distro-scripts \     --force-libglx-indirect \     --glvnd-egl-config-path=/etc/glvnd/egl_vendor.d \     --egl-external-platform-config-path=/etc/egl/egl_external_platform.d \     --dkms \     --no-cc-version-check \     --silent

Go to the same /var/lib/dkms/nvidia/440.82/ directory. Enter source/, where dkms.conf and Makefile is. Try this:

sudo dkms autoinstall

All nvidia modules will be successfully build, added and installed. ls /var/lib/dkms/nvidia/440.82/ will now correctly show a proper dkms compilation tree: source/ 5.6.15-957.native/

The kernel should be custom rebuild with SECTION_MISMATCH_WARN_ONLY (Kernel hacking -> Compile-time checks and compiler options -> Make section mismatch errors non-fatal), following the simple guide: https://docs.01.org/clearlinux/latest/guides/kernel/kernel-development.html. Remember to install the *-dev package of your newly custom built kernel as well: rpm2cpio linux-dev-5.6.15-957.x86_64.rpm | (cd /; sudo cpio -i -d -u -v);

jonahbron commented 4 years ago

I found @SPAstef's solution to be fairly straightforward. It has the drawback of not working with DKMS however, so I'll have to manually update the kernel and reinstall the driver periodically.

Here's the exact diff I used on the install.sh file.

--- a/NVIDIA-Driver/install.bash
+++ b/NVIDIA-Driver/install.bash
@@ -64,7 +64,8 @@ echo -e "\e[32m The version of the driver is \e[33m""$([[ "$INSTALLER" =~ ^.*\-(
 echo "${BASH_REMATCH[1]}")\e[m"
 read -rp "Press any key to continue ... " -n1 -s
 echo
-if ! sudo sh "$INSTALLER" \
+export CONFIG_SECTION_MISMATCH_WARN_ONLY=y
+if ! sh "$INSTALLER" \
     --utility-prefix=/opt/nvidia \
     --opengl-prefix=/opt/nvidia \
     --compat32-prefix=/opt/nvidia \
@@ -81,8 +82,8 @@ if ! sudo sh "$INSTALLER" \
     --force-libglx-indirect \
     --glvnd-egl-config-path=/etc/glvnd/egl_vendor.d \
     --egl-external-platform-config-path=/etc/egl/egl_external_platform.d \
-    --dkms \
     --silent; then
   echo -e "\e[31m Installation failed! Aborting...\e[m"
   exit 1
 fi

And ran the install.bash script as root.

I definitely tried several combinations to get it working with DKMS. My guess is that DKMS is compiling the modules itself and the env var isn't making it through to that point. Hopefully CL or NVidia can fix this soon.

SPAstef commented 4 years ago

I had the same issue with DKMS, it completely ignores any environment variable. Its purpose should be to avoid reinstalling drivers after updates, but it seems that we need to reinstall them anyway so not a big deal

jonahbron commented 4 years ago

Someone on the NVidia forum posted a possible script solution to get it working with DKMS again.

https://forums.developer.nvidia.com/t/clear-linux-error-unable-to-load-the-nvidia-drm-kernel-module-gtx-970/124836/19?u=jonahbron.d

I'm going to try this later.

jonahbron commented 4 years ago

I think an interim script-based solution may still be possible for this. According to someone on the NVidia thread, we may be able to set the correct ENV for DKMS using this method:

https://forums.developer.nvidia.com/t/clear-linux-error-unable-to-load-the-nvidia-drm-kernel-module-gtx-970/124836/19?u=jonahbron.d

However I ran into a roadblock because I don't know how to use the PRE_BUILD config key properly. My attempt had no behavior change.

enbock commented 4 years ago

Not sure, if the today's update was the reason(last days wasn't working), but I fixed the issue now with the simple follow change:

index d29559d..1cc5ce3 100755
--- a/NVIDIA-Driver/install.bash
+++ b/NVIDIA-Driver/install.bash
@@ -64,7 +64,7 @@ echo -e "\e[32m The version of the driver is \e[33m""$([[ "$INSTALLER" =~ ^.*\-(
 echo "${BASH_REMATCH[1]}")\e[m"
 read -rp "Press any key to continue ... " -n1 -s
 echo
-if ! sudo sh "$INSTALLER" \
+if ! sudo CONFIG_SECTION_MISMATCH_WARN_ONLY=y "$INSTALLER" \
     --utility-prefix=/opt/nvidia \
     --opengl-prefix=/opt/nvidia \
     --compat32-prefix=/opt/nvidia \

Otherwise run once with --dkms and once without... was also working.

You can check with (also loaded when DKMS-Failed message appears 😯)

$ sudo dkms status
Passwort: 
nvidia, 440.100, 5.7.8-968.native, x86_64: installed

if the module was loaded.

Mein Laptop: Tuxedo Computers VGA: NVIDIA Corporation GP106BM [GeForce GTX 1060 Mobile 6GB] (rev a1) Bios Setup: DISCRETE-Mode (turned off the Intel Graphics by bios) Clear Linux OS; Build-ID: 33510

jonahbron commented 4 years ago

@enbock Glad it worked for you. I tried to install again with DKMS, but now I'm getting a compiler version mismatch. Kernel was compiled with 10.1.1, but swupd has installed 10.2.1. I filed a bug with CL.

https://github.com/clearlinux/distribution/issues/2069

lebensterben commented 4 years ago

@enbock Glad it worked for you. I tried to install again with DKMS, but now I'm getting a compiler version mismatch. Kernel was compiled with 10.1.1, but swupd has installed 10.2.1. I filed a bug with CL. ...

@jonahbron This is not a bug. When the new kernel is released this will be fixed automatically.

jonahbron commented 4 years ago

@lebensterben I see. Any idea when that will be? Seems bad to have windows of time in which the compiler versions don't match. That means any user might come along and simply not be able to install the NVidia drivers until the kernel gets an update.

lebensterben commented 4 years ago

This is a well known issue. I don't think you need to wait for long. You can also disable the GCC mismatch with an installer option.

jonahbron commented 4 years ago

Can confirm, after an update I was able to install with DKMS. Only needed CONFIG_SECTION_MISMATCH_WARN_ONLY set.

lebensterben commented 4 years ago

Can confirm, after an update I was able to install with DKMS. Only needed CONFIG_SECTION_MISMATCH_WARN_ONLY set.

@jonahbron Alternatively, you can append "--no-cc-version-check" to installer option.

lebensterben commented 4 years ago

Closing this for now since it could be installed correctly.

Driver version: 450.57 Kernel version: 5.7.13-975.native

enbock commented 4 years ago

FYI: If someone have in last days that problem back, download the newest SLB(short lived version) drivers from nvidia. (in my case this one https://download.nvidia.com/XFree86/Linux-x86_64/455.28/NVIDIA-Linux-x86_64-455.28.run)

lebensterben commented 4 years ago

@enbock Thanks. I heard that NVIDIA announced that 5.9 kernel is incompatible and it recommends users to defer to upgrade the kernel until a new NVIDIA driver is released. But luckily it worked fine for me.