vmatare / thinkfan

The minimalist fan control program
GNU General Public License v3.0
555 stars 63 forks source link

Freeze on Thinkpad P14s Gen3 AMD Machine Type 21J5 #228

Closed simonsystem closed 1 month ago

simonsystem commented 1 year ago

Hi, im having trouble with my Thinkpad P14s Gen3 AMD Machine Type 21J5. Evertime, when I start Thinkfan, its freezing after a random amount of time. No logs, direct freeze, without turning black.

I already tried:

This is my thinkfan.conf:

sensors:
  - hwmon: /sys/class/hwmon
    name: thinkpad
    indices: [1, 3, 4, 5, 6, 7]

  - hwmon: /sys/class/hwmon
    name: thinkpad
    indices: [8]
    optional: true

  - hwmon: /sys/class/hwmon
    name: nvme
    indices: [1]

  - hwmon: /sys/class/hwmon
    name: acpitz
    indices: [1]

fans:
  - tpacpi: /proc/acpi/ibm/fan

levels:
 - [0, 0, 55]
 - [1, 50, 60]
 - [2, 55, 65]
 - [3, 60, 70]
 - [4, 65, 75]
 - [5, 70, 80]
 - [7, 75, 85]
 - ["level disengaged", 80, 255]

This is my journal for thinkfan systemd service:

-- Boot 40cba4c2651649b4a54e90663138bc5e --
Mai 30 10:43:38 copper systemd[1]: Starting simple and lightweight fan control program...
Mai 30 10:43:38 copper thinkfan[898]: Daemon PID: 899
Mai 30 10:43:38 copper systemd[1]: Started simple and lightweight fan control program.
Mai 30 10:43:38 copper thinkfan[899]: Temperatures(bias): 77(0) -> Fans: level 7
Mai 30 10:43:45 copper thinkfan[899]: Temperatures(bias): 74(0) -> Fans: level 5
Mai 30 10:43:55 copper thinkfan[899]: Temperatures(bias): 60(0) -> Fans: level 3
Mai 30 10:44:05 copper thinkfan[899]: Temperatures(bias): 56(0) -> Fans: level 2
Mai 30 10:44:10 copper thinkfan[899]: Temperatures(bias): 54(0) -> Fans: level 1
Mai 30 10:46:05 copper thinkfan[899]: Temperatures(bias): 49(0) -> Fans: level 0
-- Boot fd3f2ec8214340a0922ff31e68d09722 --
Mai 30 11:55:58 copper systemd[1]: Starting simple and lightweight fan control program...
Mai 30 11:55:58 copper thinkfan[651]: Daemon PID: 654
Mai 30 11:55:58 copper thinkfan[654]: Temperatures(bias): 86(0) -> Fans: level 127
Mai 30 11:55:58 copper systemd[1]: Started simple and lightweight fan control program.
Mai 30 11:56:14 copper thinkfan[654]: Temperatures(bias): 73(0) -> Fans: level 5
Mai 30 11:56:26 copper thinkfan[654]: Temperatures(bias): 68(0) -> Fans: level 4
Mai 30 11:56:48 copper thinkfan[654]: Temperatures(bias): 64(0) -> Fans: level 3
Mai 30 11:57:13 copper thinkfan[654]: Temperatures(bias): 58(0) -> Fans: level 2
Mai 30 11:58:50 copper thinkfan[654]: Temperatures(bias): 54(0) -> Fans: level 1
Mai 30 11:59:27 copper thinkfan[654]: Temperatures(bias): 67(0) -> Fans: level 3
Mai 30 12:00:21 copper thinkfan[654]: Temperatures(bias): 59(0) -> Fans: level 2
Mai 30 12:00:46 copper thinkfan[654]: Temperatures(bias): 54(0) -> Fans: level 1
Mai 30 12:02:11 copper thinkfan[654]: Temperatures(bias): 49(0) -> Fans: level 0
-- Boot 400f452fc3fd4221a1821cb9ed5fea3e --
Mai 30 13:11:10 copper systemd[1]: Starting simple and lightweight fan control program...
Mai 30 13:11:10 copper thinkfan[633]: Daemon PID: 635
Mai 30 13:11:10 copper systemd[1]: Started simple and lightweight fan control program.
Mai 30 13:11:10 copper thinkfan[635]: Temperatures(bias): 86(0) -> Fans: level 127
Mai 30 13:11:26 copper thinkfan[635]: Temperatures(bias): 75(0) -> Fans: level 7
Mai 30 13:11:36 copper thinkfan[635]: Temperatures(bias): 69(0) -> Fans: level 4
Mai 30 13:11:51 copper thinkfan[635]: Temperatures(bias): 63(0) -> Fans: level 3
Mai 30 13:12:13 copper thinkfan[635]: Temperatures(bias): 71(0) -> Fans: level 4
Mai 30 13:12:20 copper thinkfan[635]: Temperatures(bias): 63(0) -> Fans: level 3
Mai 30 13:12:50 copper thinkfan[635]: Temperatures(bias): 59(0) -> Fans: level 2
Mai 30 13:13:57 copper thinkfan[635]: Temperatures(bias): 74(0) -> Fans: level 4
Mai 30 13:13:59 copper thinkfan[635]: Temperatures(bias): 81(0) -> Fans: level 7
Mai 30 13:14:09 copper thinkfan[635]: Temperatures(bias): 66(0) -> Fans: level 4
Mai 30 13:14:21 copper thinkfan[635]: Temperatures(bias): 61(0) -> Fans: level 3
Mai 30 13:14:31 copper thinkfan[635]: Temperatures(bias): 59(0) -> Fans: level 2
Mai 30 13:15:45 copper thinkfan[635]: Temperatures(bias): 54(0) -> Fans: level 1
Mai 30 13:20:45 copper thinkfan[635]: Temperatures(bias): 49(0) -> Fans: level 0

As you can see, a few minutes it controls fan level, but then I got this system freeze. Without starting thinkfan or zcfan, it properly works, without freezing, but with that annoying noise of my fan.

My system:

Edit: Link to Kernel.org Bugzilla issue: https://bugzilla.kernel.org/show_bug.cgi?id=217548 Link to Lenovo Forums topic: https://forums.lenovo.com/t5/ThinkPad-T400-T500-and-newer-T-series-Laptops/ThinkPad-T14-Gen-3-21CF-kernel-freezes-when-controlling-fans-on-Linux/m-p/5252479

vmatare commented 1 year ago

Hi @simonsystem, you're writing "system freeze", so by that you mean that the entire system freezes? Or is it just thinkfan that freezes (i.e. stops doing anything)?

If the entire system locks up then there's probably not much thinkfan can do about it because that would be an issue with your kernel and/or drivers. You might try disabling individual sensors to find out which sensor (or fan) is triggering the freeze.

If it's just thinkfan that freezes, you could get more information with strace:

sudo strace -p `pgrep thinkfan`

And post the output here.

top-on commented 1 year ago

@vmatare , I appear to have the same problem as @simonsystem . In my case, the whole system freezes. Are there any useful diagnostics to pull, in this case?

simonsystem commented 1 year ago

Hi @simonsystem, you're writing "system freeze", so by that you mean that the entire system freezes? Or is it just thinkfan that freezes (i.e. stops doing anything)?

If the entire system locks up then there's probably not much thinkfan can do about it because that would be an issue with your kernel and/or drivers. You might try disabling individual sensors to find out which sensor (or fan) is triggering the freeze.

If it's just thinkfan that freezes, you could get more information with strace:

sudo strace -p `pgrep thinkfan`

And post the output here.

No, it's the whole system that freezes. without any logging to dmesg or similar. I think its an thinkpad_acpi related issue. I will create an issue there and link that to this issue.

@top-on May you post your system specs here as well? Is it also a Thinkpad P14s Gen3 Machine?

top-on commented 1 year ago

This is my system, which also freezes after a random time when running thinkfan:

Maybe noteworthy: I am observing the same freezing behavior when running fancontrol.service or CoolerControl.

@simonsystem , thank you for creating and linking that issue!

simonsystem commented 1 year ago

Added a link to a freshly created Kernel.org Bugzilla issue at: https://bugzilla.kernel.org/show_bug.cgi?id=217548

@top-on Thanks for your system specs. Hope, we can help fixing that issue.

vmatare commented 1 year ago

That sounds very inconvenient. Have any of you tried to find out how badly the system is frozen? Because sometimes (though mostly on Display-related problems) it's only the graphical UI (X, Wayland etc.) that freezes, but the Linux text consoles continue to work. So sometimes you can still use Strg-Alt-F1 through Strg-Alt-F6 to pull up one of the text consoles, log in there and check the kernel log with dmesg.

Another important test is whether the NumLock LED will still switch on & off. If it doesn't, that means your entire kernel is frozen and there's truly nothing left to do except hard reset.

top-on commented 1 year ago

@vmatare , i can confirm that the system fully freezes in these cases: changing the interface with Strg-Alt-F6 is not possible when frozen. because i do not have an numblock on my keyboard, i currently cannot check the LED.

i have tested thinkfan also with the new BIOS version for the laptop model: 0.1.28. the other system parameters remained as above. unfortunately, the system also freezes with this new BIOS version.

just for a cross-reference that might be useful, i currently see greater system stability with the coolero flatpak and the latest BIOS, which however also froze at some point with the previous BIOS version. i will run coolero now for a few weeks with the latest BIOS, to see if that is more stable than before.

top-on commented 1 year ago

I have to report that the coolero also (fully) freezes my system with the above-mentioned parameters. It freezes somewhat later than with thinkfan, though :thinking:

I will re-run the tests whenever a new kernel will be shipped to pop_OS!, or a new BIOS gets released.

PiotrTD5 commented 1 year ago

Thinkfan was causing freezes so I was searching for another solution for dumb stock fan control (pulsing, delayed reaction to temperature rise). I would like to report that using pwmconfig from lm-sensors also causing freezes. After freeze - changing keyboard backlight is working (don't know if it's helpful).

simonsystem commented 1 year ago
  • Laptop: Thinkpad T14 Gen3 AMD (21CF)

@PiotrTD5 This ticket only concerns P14s Gen3 AMD models. Even though, your BIOS has the same version number, I cannot confirm that we are talking about the same issue. I want to avoid this ticket to be a general thinkpad-freeze issue. Please open another ticket for your laptop model and reference this ticket to it.

Edit: @PiotrTD5 You are right. My fault, I also think now, that yours is the same.

PiotrTD5 commented 1 year ago

I just wanted to help. The only difference between P14s Gen3 AMD and T14 Gen3 AMD is model name on LCD bezel and stickers.

They share same BIOS/EC firmware. From official Lenovo BIOS update readme: Support models:

Also, if you study pcsupport.lenovo.com, parts category, you'll find out that 21J5 and 21CF share the same FRU numbers for motherboards. I don't know about T16 vs P16s and I don't have time to check.

So IMHO, you should add T14 Gen3 AMD model to this issue instead creating another. Don't know why you strictly want it to be P14s Gen3 issue when technically it's the same hardware and firmware. I have zero experience in using github so I'll do what you ask if I am really wrong about this.

p345123 commented 1 year ago

the same happens on my ThinkPad P16s Gen 1: total system freeze some time after thinkfan starts

Lillecarl commented 1 year ago

I've got a T14 G3 AMD with the same issue of kernel freezing after awhile of usage.

However with experimental=1 and fan_control=1 modprobe params i can stull echo levels, timeout, enable, disable, disengage into /proc/acpi/ibm/fan without the kernel freezing on me.

Lillecarl commented 1 year ago

I wrote my own shitty Python script as a thinkfan "replacement" and noticed that this happens when we write levels frequently to the fan control file. I built the script so that it checks the current level and compares with what I'd like to set and it seems to be rather "stable" for me now.

https://gist.github.com/Lillecarl/15b683c3cd3bafe74ca3c4dafd427d2e This is the script i used for my testing, keeps my laptop silent for the most part but will ramp the fan all the way up to full-speed (not sure if that's dangerous for the fan or not) if temperatures are high

EDIT: Further testing indicates I was just lucky in the beginning. After realizing i have to write to the fan control file every 110 seconds (after setting watchdog to 120) I started experiencing random lockups again. (Only writes reset the watchdog timeout, which I think is a good idea to keep active if fan control software crashes).

Lillecarl commented 1 year ago

https://forums.lenovo.com/t5/ThinkPad-T400-T500-and-newer-T-series-Laptops/ThinkPad-T14-Gen-3-21CF-kernel-freezes-when-controlling-fans-on-Linux/m-p/5252479 Reported to Lenovo forums too

top-on commented 1 year ago

EDIT: Further testing indicates I was just lucky in the beginning. After realizing i have to write to the fan control file every 110 seconds (after setting watchdog to 120) I started experiencing random lockups again. (Only writes reset the watchdog timeout, which I think is a good idea to keep active if fan control software crashes).

@Lillecarl , i really liked your idea of boiling down fan control to "read temperature" and "reduce fan speed for X seconds". i tested a simplified version of your script, but it also completely freezes my machine after some time. it was worth a shot, though :slightly_smiling_face:

simonsystem commented 1 year ago

BTW: As a workaround, I switched my notebook to "Cool 'n' Quiet" mode in BIOS and completely disabled thinkfan. I think I lost performance, but its not as loud as before. But its not the solution, of course.

@all: Thanks for all your suggestions and assistance in analyzing this issue. @PiotrTD5: Sorry, that I didnt realize, your issue is really the same thing. @Lillecarl: Special thanks for your scripting tests. Good idea, but poorly... nah.

Lillecarl commented 1 year ago

@simonsystem I've been able to control my fans reliably by always stepping through level 1 before level 0.

image

That's 3 hours, controlling the fans with software all the time.

Please ignore the steep stepping up and down, my control software isn't as polished as thinkfan, although I've got some nice ideas involving reading CPU Package Power from the MSR and use that to step the fans based on actual heat dissipation needs like https://github.com/hirschmann/nbfc does for Windows

EDIT: false...... further natural testing by stressing the cpu every 30-60 seconds got another hang. On the bright side, after switching randomly between levels 1-7 I've discovered that it's going to 0 that freezes the system, no other levels https://prints.lillecarl.com/20231012-225047_lldegbbcjk.png

ishfx commented 1 year ago

I've got exactly the same issue with my P14s Gen3 AMD. For now I completely disabled thinkfan (otherwise, I had a freeze every few minutes, looks like a kernel panic because the the REISUB does not respond).

a-rasinski commented 1 year ago

@Lillecarl Sir, you're a lifesaver! I've been pulling my hair due to random hard freezes as mentioned above and it took me some time to pinpoint this issue onto fan control. Albeit I can confirm that not using level 0 mitigates any freezes on my machine.

EVODelavega commented 1 year ago

Didn't use this utility myself, but found out about it today because someone pointed me specifically to this issue. I'll have to take a closer look, but the issue, as others have noticed here, too, seems to be related to the fan speed levels. As per CMake, they can either be numeric values in the 0-7, or 0-255 ranges (https://github.com/vmatare/thinkfan/blob/master/src/thinkfan.conf.5.cmake#L439). The 0-7 range may not be handled properly when adding the fan speed levels here: https://github.com/vmatare/thinkfan/blob/master/src/config.cpp#L106

The config shown here sets the disengaged level as the last level to be added, which at first glance should map to std::numeric_limits<int>::min();.

I'm not going to speculate any further as to how that might contribute to this bug without cloning the repo and going through the code itself, but that would be where I'd look, so thought I'd mention it here.

Some shameless self-promotion: I found out about this because I hacked together a small utility to manage fan speeds on my old thinkpad (GTK+3, old school C). It's nowhere near as feature complete as this tool, but maybe some of you here can use it until this bug gets fixed: https://github.com/EVODelavega/fan_control

vmatare commented 1 year ago

Guys, this is clearly a kernel bug (or most probably in the thinkpad_acpi kernel module). You need to check the kernel.org bugtracker and potentially report it there.

Lillecarl commented 1 year ago
/sys/class/power_supply/BAT0/hwmon0/subsystem/hwmon1/pwm1

Setting values there to (255/7)*level doesn't lock up my machine.

Lillecarl commented 10 months ago

https://download.lenovo.com/pccbbs/mobiles/r23uj73wd.html - (New) Change to permit fan rotation after fan error happen.

simonsystem commented 10 months ago

https://download.lenovo.com/pccbbs/mobiles/r23uj73wd.htm - (New) Change to permit fan rotation after fan error happen.

@Lillecarl Did you try it? Does it solve our issue? Sounds promising!

Lillecarl commented 10 months ago

@simonsystem Yep, it's finally working! The EC fancontrol is also quite decent, so I rewrote my fancontrol script to turn fans off if average temp is below 60 for 30 seconds, and turn to auto if average temperature is above 60 for 30 seconds or above 70 for one measurement. https://github.com/Lillecarl/nixos/blob/master/scripts/fancontrol2.py It can be simplified further but it's got legacy from previous attempts at things 😄

I reckon we can close this? If the new UEFI and EC is out for your model too 😄

simonsystem commented 10 months ago

At least for P14s Gen3 (21J5), this BIOS version isn't available anymore. https://pcsupport.lenovo.com/us/en/products/laptops-and-netbooks/thinkpad-p-series-laptops/thinkpad-p14s-gen-3-type-21j5-21j6/downloads/ds557681-bios-update-utility-bootable-cd-for-windows-10-64-bit-thinkpad-t14-gen-3-type-21cf-21cg-t16-gen-1-type-21ch-21cj-p16s-gen-1type-21ck-21cl?category=BIOS%2FUEFI

This BIOS version R23UJ73W is reported Lenovo cloud not working issue, hence it has been withdrawn from support site.

I downloaded it, once it was available. The fan issue was gone, I could set my fan to 0 without freezes.

But I got standby issues. The system now freezes, when coming back from deep standby, after staying at sleep for an hour or so. Poorly, there is no BIOS option for changing the standby mode, so I cannot try other modes. I think it's fixed to "Modern Standby", which is maybe not well supported by Linux. I'm not an expert in these hardware things. (https://wiki.archlinux.org/title/Power_management/Suspend_and_hibernate)

@Lillecarl So, nah, BIOS version 1.49 (R23UJ73WD) has been withdrawn. So, it's not closed yet, isn't it? How about your model, is that BIOS version still available?

Lillecarl commented 10 months ago

@simonsystem It's withdrawn for T14 G3 as well. Meme company. I'm using s2idle, on the AMD system it draws just 30% per 2 days or so so it's good enough for me.

@lillecarl:matrix.org if you wanna keep discussing, this is already miles offtopic from thinkfan 😄

a-rasinski commented 10 months ago

It's withdrawn for T14 G3 as well. Meme company.

@Lillecarl Sorry for putting my 2 cents to the offtop, but this is mildly infuriating as it's the second bios version withdrawn in a row to which I've updated. Previous withdrawn one could brick the device, I hope this one won't. Meme company indeed.

KiitoX commented 10 months ago

From that Lenovo thread it seems like a proper fix might take another while. In the meantime, another possible workaround is using "level auto" instead of speed 0 for the idle fan speed setting. This does turn off the fan for sufficiently low temperatures, though I have not found the exact boundary yet.