vmatare / thinkfan

The minimalist fan control program
GNU General Public License v3.0
541 stars 62 forks source link

For an nvme device, thinkfan errors about `device` not having sensors, despite `hwmon*` directories being present #157

Closed griwes closed 2 years ago

griwes commented 2 years ago

I have a system with two nvme drives, their temp_input paths are as follows:

/sys/devices/pci0000:00/0000:00:1b.0/0000:02:00.0/nvme/nvme0/hwmon0/temp1_input
/sys/devices/pci0000:00/0000:00:1b.0/0000:02:00.0/nvme/nvme0/hwmon0/temp2_input
/sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1/hwmon1/temp1_input
/sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1/hwmon1/temp2_input

but the indexing of the hwmon changes on boot in ways that I do not feel like finding about. I was hoping that the following config would catch these:

- hwmon: /sys/devices/pci0000:00/0000:00:1b.0/0000:02:00.0/nvme/nvme0
  incides: [1,2]

- hwmon: /sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1
  indices: [1,2]

but it doesn't work:

Oct 10 14:57:51 <hostname> thinkfan[2336483]: ERROR: Could not find an `hwmon*' directory or `temp*_input' file in /sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1/device.

It appears that the hwmon logic searches for directories that start with both hwmon and device, but this is clearly incorrect in this case. For reference, the contents of /sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1 are as follows:

-r--r--r--  1 root root 4096 Sep 30 23:13 address
-r--r--r--  1 root root 4096 Sep 30 23:13 cntlid
-r--r--r--  1 root root 4096 Sep 30 23:13 dev
lrwxrwxrwx  1 root root    0 Sep 30 18:51 device -> ../../../0000:55:00.0
-r--r--r--  1 root root 4096 Sep 30 18:51 firmware_rev
drwxr-xr-x  3 root root    0 Sep 30 18:51 hwmon1
-r--r--r--  1 root root 4096 Sep 30 23:13 kato
-r--r--r--  1 root root 4096 Sep 30 18:51 model
drwxr-xr-x  3 root root    0 Sep 30 18:51 ng1n1
-r--r--r--  1 root root 4096 Sep 30 23:13 numa_node
drwxr-xr-x 11 root root    0 Sep 30 18:51 nvme1n1
drwxr-xr-x  2 root root    0 Sep 30 23:13 power
-r--r--r--  1 root root 4096 Sep 30 23:13 queue_count
--w-------  1 root root 4096 Sep 30 23:13 rescan_controller
--w-------  1 root root 4096 Sep 30 23:13 reset_controller
-r--r--r--  1 root root 4096 Sep 30 18:51 serial
-r--r--r--  1 root root 4096 Sep 30 23:13 sqsize
-r--r--r--  1 root root 4096 Sep 30 23:13 state
-r--r--r--  1 root root 4096 Sep 30 23:13 subsysnqn
lrwxrwxrwx  1 root root    0 Sep 30 18:51 subsystem -> ../../../../../../class/nvme
-r--r--r--  1 root root 4096 Sep 30 23:13 transport
-rw-r--r--  1 root root 4096 Sep 30 18:51 uevent

Note hwmon1 is there, but the config parser decides to error out because device is not a hwmon directory, which is a faulty logic somewhere in the config parser.

I can't use a workaround of setting hwmon-name to nvme either, similarly to #156 , because there's symlinks identical to device within ng1n1 and nvme1n1, which makes it error out (despite all of the paths it founds being the same file!).

For the first problem, because I don't understand why thinkfan looks for device, I'd be happy with being able to explicitly tell it to not look at device at all in the config file; I believe that that would solve my original issue.

For the second problem, may I suggest that when you compile a list of hwmons, but before you check for the name being the same in multiple ones, you canonicalize the paths to remove any symlinks in the path elements, and then filter the list so that every canonical path is on the list only once? (This would be separate from #156, but even with the fix for the other issue, I think it'd still be beneficial, because on every poll, thinkfan would have to read less files, by not reading the same file multiple times.)

griwes commented 2 years ago

I have also tried a workaround of using optional and enumerating both those hwmon ids I've seen for the nvme devices on boot for both, but it appears that that doesn't stop thinkfan from erroring out when the directory itself doesn't exist.

So for now I'm working around this with explicitly listing sensors for /sys/class/hwmon/hwmon0 and /sys/class/hwmon/hwmon1, since empirically those two have always been the nvme hwmons (as far as I can tell from looking at this a number of times), but this is extremely fragile and I would rather not need to depend on the kernel not changing its driver assignments at any time for this to work :)

vmatare commented 2 years ago

Huh. Sorry, just got confused there because I overlooked some details. From looking at your directory listing, the way you configured it should actually work. I'll have to take another good look at the search logic.

DanielWeigl commented 2 years ago

I have the same problem since updating to Ubuntu 21.10 (it worked on previous version)

  - hwmon: /sys/class/block/nvme0n1
    indices: [1,2,3]
    optional: true

leads to ERROR: Could not find an 'hwmon*' directory or 'temp*_input' file in /sys/class/block/nvme0n1/device/device.

Which is a bit confusing, because there is one:

find /sys/class/block/nvme0n1/device/ | grep hwmon
/sys/class/block/nvme0n1/device/hwmon4
/sys/class/block/nvme0n1/device/hwmon4/temp2_min
/sys/class/block/nvme0n1/device/hwmon4/temp2_max
/sys/class/block/nvme0n1/device/hwmon4/temp3_input
...

Also, specifying a name:

  - hwmon: /sys/class/block/nvme0n1
    name: nvme
    indices: [1,2,3]
    optional: true

leads to

ERROR: /etc/thinkfan.yaml:57:
    name: nvme
          ^
Found multiple hwmons with this name:  /sys/class/block/nvme0n1/device/nvme0n1/device/hwmon4 /sys/class/block/nvme0n1/device/hwmon4 /sys/class/block/nvme0n1/device/ng0n1/device/hwmon4.

This works:

  - hwmon: /sys/class/block/nvme0n1/device/hwmon4
    indices: [1,2,3]
    optional: true

But the index for hwmon4 is not stable.

uname -a
Linux t15 5.13.0-20-generic #20-Ubuntu SMP Fri Oct 15 14:21:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

thinkfan       version 1.2.1-3  
vmatare commented 2 years ago

@DanielWeigl if you want to help you can try out the lm-sensors branch, which adds support for libsensors. Then you can specify sensors like this this:

  - chip: CHIP-NAME
    ids:
      - FEATURE-ID0
      - FEATURE-ID1
      - ...

You can find CHIP-NAME and FEATURE-IDs by running sensors. E.g. for my NVME SSD I get:

nvme-pci-2500
Adapter: PCI adapter
Composite:    +35.9°C  (low  = -273.1°C, high = +80.8°C)
                       (crit = +80.8°C)
Sensor 1:     +35.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +44.9°C  (low  = -273.1°C, high = +65261.8°C)

So that would yield the config:

  - chip: nvme-pci-2500
    ids:
      - "Sensor 1"

So far libsensors support is looking good, but we've been observing an issue where it breaks down after resuming from suspend. Would be interesting to know whether you're also seeing this issue or not. On my system it has been working fine for days across multiple suspend/resume cycles as of now.

koutheir commented 2 years ago

So far libsensors support is looking good, but we've been observing an issue where it breaks down after resuming from suspend. Would be interesting to know whether you're also seeing this issue or not.

@DanielWeigl, the commit fe9d2e8 in the lm-sensors branch introduced retries for some amount of time when reading sensors fails. This fixes the issue for me by retrying 5 to 6 times after resuming from suspend. If you want to test that branch without this fault-tolerant behavior, then you'll need to fetch and build the commit 05f1636 from the lm-sensors branch.

You can also test the fault-tolerant build (by building lm-sensors's head commit) and monitor the thinkfan logs for messages like this:

Retrying reading LM sensor 'CPU' of chip 'thinkpad-isa-0000': Can't read (attempt 6)
DanielWeigl commented 2 years ago

Hi - thank you for the heads up and for the new feature - looks good.

Im now running the latest lm-sensors branch with this settings:

  - chip: nvme-pci-5500
    ids:
     - "Sensor 1"
     - "Sensor 2"

  - chip: nvme-pci-0200
    ids:
     - "Sensor 1"
     - "Sensor 2"

so far it works, ill install it as default and will report back

DanielWeigl commented 2 years ago

just fyi, i have it now running with above version/config for a few days and some standbys and reboots - sofar it works and there are no errors logged to journalctl -u thinkfan.service

Only these messages:

Nov 10 17:12:59 d-t15 thinkfan[1435]: Temperatures(bias): 48(0), 45(0), 46(0), 46(0), 45(0), 35(0), 33(0), 34(0), 30(0), 43(0), 45(0) -> Fans: level 1
Nov 10 17:15:24 d-t15 thinkfan[1435]: Going to sleep: Will allow sensor read errors for the next 4 loops.
Nov 11 09:11:04 d-t15 thinkfan[1435]: Received SIGUSR2: Re-initializing fan control.
Nov 11 09:11:04 d-t15 thinkfan[1435]: Temperatures(bias): 53(0), 46(0), 31(0), 53(0), 33(0), 24(0), 23(0), 24(0), 23(0), 23(0), 23(0) -> Fans: level 1
vmatare commented 2 years ago

I'm fairly certain that the new lm-sensors-based config syntax is the right solution to this problem. I've been thinking about supporting this case this in the sysfs/hwmon path-based config as well, but I've come to the conclusion that resolving the ambiguity there is not worth the effort. I like the lm-sensors interface (i.e. with the chip: keyword) much better since it eliminates other problems, as well.