vmatare / thinkfan

The minimalist fan control program
GNU General Public License v3.0
552 stars 62 forks source link

ERROR: Failed to load NVML driver: No such file or directory #142

Closed gdarruda closed 2 years ago

gdarruda commented 3 years ago

I'm trying to use thinkfan with Fedora, but:

[gdarruda@fedora ~]$ thinkfan -c /etc/thinkfan.conf.rpmsave 

ERROR: Failed to load NVML driver: No such file or directory

I installed the drivers and nvidia-smi seems fine:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   32C    P5    15W /  75W |    860MiB /  3908MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1822      G   /usr/libexec/Xorg                 344MiB |
|    0   N/A  N/A      1959      G   /usr/bin/gnome-shell              319MiB |
|    0   N/A  N/A      2966      G   /usr/lib64/firefox/firefox        192MiB |
+-----------------------------------------------------------------------------+

My conf file is this:

sensors:
  - nvml: 01:00.0

fans:
  - hwmon: /sys/devices/platform/nct6775.656/hwmon/hwmon4/pwm1

levels:
  - [0, 0, 30]
  - [100, 30, 50]
  - [200, 50, 60]   
  - [255, 60, 100]

Using Pop_OS!, I built from source and worked fine. I don't know if I missed some nVidia package, since Pop_OS! already came with nVidia drivers. Sorry if it's not a thinkfan problem, but I don´t know how to troubleshoot this.

vmatare commented 3 years ago

Sorry, that error message could do with some improvement. What's actually failing there is a dlopen("libnvidia-ml.so", RTLD_LAZY). So the dynamic loader can't find your libnvidia-ml.so anywhere in its search paths. There are usually two ways the search paths are determined:

  1. The LD_LIBRARY_PATH environment variable.
  2. The config files /etc/ld.so.conf and everything in /etc/ld.so.conf.d.

So first, check the package contents of your nvidia driver for the location of libnvidia-ml.so, and then make sure its path is found somewhere in the ld.so search paths. Typically, your driver package should put some file in /etc/ld.so.conf.d. If it's there, make sure the cache is updated by running ldconfig.

gdarruda commented 3 years ago

Thanks, the libraries are present, but the symlinks were missing. I created one myself for libnvidia-ml.so: sudo ln -s libnvidia-ml.so.1 libnvidia-ml.so . It's working now!

Installing the driver from the nVidia site, these links are created as expected. The problem only occurs when installing the driver is installed from RPM Fusion (repo of non-free software for Fedora).

It's a bug of RPM Fusion or there is some reason to don't create these links? if it's a bug, I will try to report.

Thanks for this project, it's really useful for me.

vmatare commented 3 years ago

It's a bug of RPM Fusion or there is some reason to don't create these links? if it's a bug, I will try to report.

I'm really not sure right now whether there is some standard or convention that says that (or when) unversioned symlinks have to be present. My hunch is that it depends on what most other consumers of that library look for. If everything else can find the lib without that symlink, then maybe thinkfan needs to try both names and not rely on an unversioned symlink.

vmatare commented 2 years ago

Turns out that the convention appears to be that only development packages install unversioned symlinks. So since we load libnvidia-ml.so after compilation at runtime, we cannot expect an unversioned symlink to be present.