tuxedocomputers / tuxedo-control-center

A tool to help you control performance, energy, fan and comfort settings on TUXEDO laptops.
GNU General Public License v3.0
522 stars 68 forks source link

Fans not changing speed after TCCD crashes #403

Open notlet opened 4 months ago

notlet commented 4 months ago

I've first come across this issue about a month ago. I was playing a game, when I noticed that my Tuxedo laptop was painful to touch because it was so hot. Then, I noticed that I haven't heard the fan noise in a while, and tried launching TCC to check up on it, but the GUI wasn't showing up. I tried killing and restarting it, starting it from terminal, but it did not produce any meaningful results, the GUI never showed up. After a reboot, everything was working fine, and the fans immediately spun up to cool down the nearly 80°C that have accumulated. I opened a support ticket, but they told me that when TCC dies, the fan control should've been handed off to EC, which did not happen, and told me to send them more information if more incidents occur. And then, the second incident occured a few weeks later, when I was playing another game, and it went pretty much the same, except this time I thought of capturing the output of dmesg, and systemctl logs of tccd.service. I sent them another support ticket, but they replied with the following, which wasn't very useful.

It rarely happens that the TCC service crashes. However, this is not a major problem because the service is running for the most part and should not cause any problems. You are welcome to continue observing the further behavior. If the TCC service crashes frequently, then there appears to be an error. We'll have to have a look. Until then, you can continue to observe the behavior. For further questions, please do not hesitate to contact us.

And finally, lately these incidents have been happening more and more often. There was also a time when the fans got stuck spinning at a certain mid-range speed, and no matter how unloaded the laptop was, the noise persisted until I rebooted. It now happens whenever I'm doing anything, not just playing games. I cannot find what triggers it, all the incidents have happened seemingly at random. Also, I have to force shutdown it, because otherwise the shutdown process gets infinitely stuck at the "Stopping tccd.service" step.

System Info

CachyOS (Arch Linux), using the tuxedo-control-center-bin AUR package, which is just a repackaged RPM. The laptop is Tuxedo Polaris 15 Gen5. Using GNOME 46 on Wayland, if that matters.

TLDR

TCCD crashes and leaves fans at a set speed, regardless of system load, until reboot, which can lead to severe overheating.

sorry for writing a whole essay, this has been annoying me for a while now
tuxedoder commented 4 months ago

the GUI wasn't showing up

tcc ui currently does not open if tccd is not active. You can check the status with sudo systemctl status tccd.

leaves fans at a set speed, regardless of system load, until reboot

A reboot will restart tccd.

when TCC dies, the fan control should've been handed off to EC, which did not happen

tccd can only set the fan status if it exits normally and if it crashed abruptly it can't run code. If it just abruptly terminated while custom fan control was on, it most likely will just be stuck with the last set fan speed. It is supposed to call onExit() if it closes normally and set it in auto mode there.

https://github.com/tuxedocomputers/tuxedo-control-center/blob/392d25445d25ef15f5d092a3744e7770f9655578/src/service-app/classes/FanControlWorker.ts#L109

journalctl is supposed to say something like this:

Jul 09 16:11:23 test tccd[4100]: Stopping daemon..
Jul 09 16:11:23 test tccd[4100]: Daemon is stopped
Jul 09 16:11:23 test systemd[1]: tccd.service: Deactivated successfully.
Jul 09 16:11:23 test systemd[1]: tccd.service: Consumed 18.888s CPU time.

At first glance I am not sure what happened, since I never saw this. Have you tested other operating systems or kernel versions? I would recommend trying a different kernel, since that looks like a kernel bug to me.

[  353.992714] note: power-profiles-[1292] exited with irqs disabled
[  353.992719] note: power-profiles-[1292] exited with preempt_count 1
[  359.152498] BUG: kernel NULL pointer dereference, address: 00000000000000bf
[  359.152508] #PF: supervisor read access in kernel mode
[  359.152513] #PF: error_code(0x0000) - not-present page
[  359.152517] PGD 13714d067 P4D 13714d067 PUD 13a41a067 PMD 0 
[  359.152526] Oops: 0000 [#2] PREEMPT SMP NOPTI
[  359.152532] CPU: 11 PID: 799 Comm: tccd Tainted: P      D    OE      6.9.3-4-cachyos #1 c10c5896a2d0b05b7b9b42dc27803b51e720f172
[  359.152532] CPU: 11 PID: 799 Comm: tccd Tainted: P      D    OE      6.9.3-4-cachyos #1 c10c5896a2d0b05b7b9b42dc27803b51e720f172
...
[  359.152938] note: tccd[799] exited with irqs disabled
[  359.152947] note: tccd[799] exited with preempt_count 1
notlet commented 4 months ago

Just happened again today, here's the dmesg log. I'll try using a different kernel for a while, but it would be nice to figure out the issue on cachyos kernel aswell.

tuxedoder commented 4 months ago

sudo journalctl --boot > log.txt would also be useful, since more information is written there. --boot shows the log of the current boot.

If it really is a kernel bug, it would make more sense to report it to kernel developers or maintainers instead.

notlet commented 4 months ago

Here's the journalctl log, unfortunately it is very long since my laptop has been booted for several hours when the issue happened.

notlet commented 3 months ago

After moving to the linux-zen kernel, I haven't had this issue happen ever since. So seems like it is a problem with the cachyos kernel after all.