intel / thermal_daemon

Thermal daemon for IA
GNU General Public License v2.0
540 stars 117 forks source link

Dell Lattitude 5420 (TGL) throttled to 400MHz CPU / 100MHz GPU after 30s #293

Closed majanes-intel closed 2 years ago

majanes-intel commented 3 years ago

Kernel: 5.11.3 Debian: Testing thermald: 2.4.3 (debian unstable) processor: i7-1185G7 -- 28 W TDP

After running power-intensive workloads for a short amount of time, the CPU and/or GPU will be throttled down drastically to ~10% of peak.

Running turbostat reveals that the peak current is ~16W, far below the TDP limit.

Running lm-sensors shows that the peak temp is ~50C, far below the limit.

After reading #291 and #280, I enabled debug logs for thermald. thermald.log

@spandruvada let me know if more information is needed. I can also bring the system to you in JF1. Mesa team will be using this laptop model for perf analysis.

mazzz1y commented 3 years ago

Summary for new users:

If I'm not mistaken we have a two issues:

1) Throttling to 400mhz -- which caused by incorrect work of legacy interface. This issue was resolved by this commit https://github.com/intel/thermal_daemon/commit/1ad03424f7f3d339521635f08377b323375b2747 and v2.4.6 release 2) Throttling to 1800mhz it is another issue and currently I haven't any idea how resolve it. I will be appreciated if someone help to resolve it

Affected laptops: Dell Latitude 5420/7420/7520 Related issues: https://github.com/erpalma/throttled/issues/255

n0rad commented 3 years ago

I have a Latitude 7420 and running thermald on version 2.4.6 (or master) is not helping with 400Mhz throttling. I don't get why but I don't have the fix info message Disable rapl-msr interface and use rapl-mmio.

Did I forgot to set up something ?

This is especially weird since after an hour playing with throttled then thermald the first time I ended up with a "usable laptop" running thermald (up to 1.8Ghz, and no drop to 400Mhz). But since I rebooted running thermald is not helping :thinking:

I'm now running with this to not just throw out my laptop (please don't slap me)

while true; do
  rmmod intel_rapl_msr
  rmmod processor_thermal_device
  rmmod processor_thermal_rapl
  rmmod intel_rapl_common
  rmmod intel_powerclamp

  modprobe intel_powerclamp
  modprobe intel_rapl_common
  modprobe processor_thermal_rapl
  modprobe processor_thermal_device
  modprobe intel_rapl_msr

  sleep 1
done
mazzz1y commented 3 years ago

Hi @n0rad, it works for my Latitude 7520(1800mhz only) without a problem, I didn't install throttled

What distro/kernel version, bios version and are you sure that thermald enabled on startup? I think that 7420 should be identical 7520/7320 because bios is the same

n0rad commented 3 years ago

Ok so while I was collecting the info I found that throttled was starting as a unit under the name lenovo_fix.service. Definitly incompatible.

Here is the info, just in case:

os: archlinux
kernel : 5.12.15-arch1-1
bios : 1.7.1
thermald: 2.4.6
command: /usr/bin/thermald --systemd --dbus-enable --adaptive

I still don't have the fix info message, but it's no more dropping to 400Mhz. Sorry for the noise.

I hope we will have a solution for the max 1.8Ghz :crossed_fingers:

mazzz1y commented 3 years ago

I cannot fix it, so waiting for fix from somewhere too. I will inform here if I find something

Just interesting, do linux versions of these laptops experiencing the same problems? Unfortunatelly dell doesn't offer oem image of ubuntu for download to check

I asked dell-care about this issue, but didn't receive any adequate answer. Looks like dell-care doesn't care

zamazan4ik commented 3 years ago

I can propose some "kind of fix" that seems like work sometimes.

Install dual-boot (Windows 10 + Linux), boot on Windows, install Intel Dynamic Tuning Driver, wait for some time (CPU frequency will be established to the normal state on Windows. Without the driver Windows OS has the same issue with 400 Mhz CPU frequency). Then reboot to Linux and don't turn off it :)

I've also tested OEM kernels from Ubuntu (they are publicly available in Ubuntu repo) - doesn't help. I've tried different OSes with different kernels - doesn't help. I've tried different BIOS versions - doesn't help. Dell support didn't provide any useful information too.

So for now I live with 1800 Mhz. And I see the only solution - I am waiting for Lenovo T14 Gen 2 AMD and I will change my laptop. And I will never buy Dell laptop again. (by the way, seems like Linux works fine on Dell XPS 13 model).

mazzz1y commented 3 years ago

Install dual-boot (Windows 10 + Linux), boot on Windows, install Intel Dynamic Tuning Driver, wait for some time (CPU frequency will be established to the normal state on Windows. Without the driver Windows OS has the same issue with 400 Mhz CPU frequency). Then reboot to Linux and don't turn off it :)

It resolved already by latest version of thermald, so we on 1800mhz now

So for now I live with 1800 Mhz. And I see the only solution - I am waiting for Lenovo T14 Gen 2 AMD and I will change my laptop. And I will never buy Dell laptop again. (by the way, seems like Linux works fine on Dell XPS 13 model).

I keep hoping that it will be fixed by bios update or hack

mhosken commented 3 years ago

I'm on a Lenovo (P15s gen2) swearing I will got back to Dell over this. So it's reassuring to know that it isn't a manufacturer issue. It's the 11th gen cpu and Intel. I'm not running thermald at all and it's just as bad with the 400MHz. If thermald 2.4.6 fixes this then I'll happily use it (I have other questions like how to get power/battery profiles working with thermald). Roll on the ubuntu 21.04 release.

I would like to check that running thermald is better than not running thermald for this kind of stuff? TIA.

mazzz1y commented 3 years ago

@mhosken some guys posted issues with lenovo in this related topic: https://github.com/erpalma/throttled/issues/255 please check it. I can't confirm that it is the Intel issue. On my second MSI laptop with i7-1185g7 cpu works perfectly(But it have a lot non-software issues -- can't recommend MSI)

I have other questions like how to get power/battery profiles working with thermald

I think that need to use TLP for that

ColinIanKing commented 3 years ago

So, I've been seeing excessive CPU throttling on my Lenovo Thinkpad T480 for a while now so I took some time to debug the thermal zone event activity and found that the passive cooling was kicking in when the acpitz reached 86 degrees C (the ACPI passive cooling threshold as returned by the APCI method _SB.PCI0.LPCB.EC.SEN1._PSV on my laptop). However, the passive cooling was disabled only when the acpitz droped below 55 degrees C (which takes 5-10 minutes on a warm day). The workaround I found was to disable this trip point by using:

echo -1 | sudo tee /sys/module/thermal/parameters/psv

mazzz1y commented 3 years ago

Thanks @ColinIanKing, maybe it is helpful for lenovo users.

I tried this workaround and it didn't help on Dell laptop. Based on spandruvada's reply above, I think that I need to found how disable trip point for TMEM sensor, but I can't find how to do it

The problem is that TMEM sensor reaches its limits of 42C in 4 seconds,, so the system is throttled from max power. Even at the start the temperature is 39C. So not much margin. Not sure what can be done here,

ColinIanKing commented 3 years ago

@dmirubtsov my hunch is that the TMEM sensor is the INT3402 thermal driver for memory temperature reporting, found in the kernel as drivers/thermal/intel/int340x_thermal/int3402_thermal.c I'm not knowledgeable about this driver but it does provide a thermal zone and you may have a _TMP acpi object that the kernel can use to gather the temperature of this device. The driver comes with a int3402_notify() handler that will handle thermal trip events. Perhaps disabling or unloading the int3402_thermal_driver may help.

mazzz1y commented 3 years ago

I have these modules loaded:

dell ~ » lsmod | grep int3
int3403_thermal        20480  0
int340x_thermal_zone    20480  2 int3403_thermal,processor_thermal_device
int3400_thermal        20480  0
acpi_thermal_rel       16384  1 int3400_thermal

I've tried to unload all of these modules, but it didn't affect anything:

dell ~ » lsmod | grep int3                            
dell ~ » 
mazzz1y commented 3 years ago

@mhosken @ZaMaZaN4iK @majanes-intel

Workaround until we will got a fix.

I've disabled SpeedShift in bios settings and got a huge video performance improvement. CPU still throttle to 1800mhz but now it not affect(or not so much) video throttling. So now gui works pretty smooth. I can even play gta5 on my laptop without freezes.

It also can be done from OS by dell-command-configure package(aur):

sudo /opt/dell/dcc/cctk --SpeedShift=Disabled
mangatmodi commented 3 years ago

The latest version of this driver fixes the issue of extremely low frequency of 400Mhz. However the CPU is still locked at 1800Mhz max.

mangatmodi commented 3 years ago

I'm on a Lenovo (P15s gen2) swearing I will got back to Dell over this. So it's reassuring to know that it isn't a manufacturer issue. It's the 11th gen cpu and Intel. I'm not running thermald at all and it's just as bad with the 400MHz. If thermald 2.4.6 fixes this then I'll happily use it (I have other questions like how to get power/battery profiles working with thermald). Roll on the ubuntu 21.04 release.

@mhosken thermald 2.4.6 does fixes it. Its locked at 1800Mhz for me now, very rarely will go to 1500.

carathorys commented 3 years ago

I have a Dell Latitude 5420 with Intel Core i5 1135g7, and I have trouble with this also. Thermald fixed the 400Mhz issue, but now my CPU ofhen throttles to ~1400Mhz (all cores), especially when I do some memory intensive work. I've tried s-tui and when I start simple Sqrt() tests on 8 threads, it can sustain a little bit longer the high frequencies, and after that it will slow dont to ~2100Mhz which is normal I think, but if I start the Malloc() tests (8 threads), it's almost instantly goes down to ~1400 Mhz.

$ uname -a    
Linux *** 5.13.7-arch1-1 #1 SMP PREEMPT Sat, 31 Jul 2021 13:18:52 +0000 x86_64 GNU/Linux
mazzz1y commented 3 years ago

@carathorys Thanks for your reply and welcome to the club :)

I see that Dell released a new version of bios for your laptop about a week ago. Did you try it?

carathorys commented 3 years ago

Yes, I've updated everything on 31th of July, and now I have the latest BIOS version 1.10.0, but the problem remains.

sameer commented 3 years ago

Yes, I've updated everything on 31th of July, and now I have the latest BIOS version 1.10.0, but the problem remains.

Same here unfortunately

aitorpazos commented 3 years ago

I just got the latest Dell firmware upgrade and the issue still persists:

  └─Latitude 7320, Latitude 7320, Latitude 7420, Latitude 7420, Latitude 7520 System Update:
        New version:      1.7.1
        Remote ID:        lvfs
        Summary:          Firmware for the Dell Latitude 7320, Latitude 7320, Latitude 7420, Latitude 7420, Latitude 7520
        License:          Proprietary
        Size:             23.6 MB
        Created:          2021-06-08
        Urgency:          Critical
        Vendor:           Dell Inc.
        Description:
        This stable release fixes the following issues:

        • Firmware updates to address security vulnerabilities.
        • Firmware updates to address the Intel Security Advisory.
        • Fixed the issue where the system screen flashes after booting.
        • Fixed the issue where the BIOS recovery is initiated when you quickly turn off and turn on the system.
        • Fixed the issue where the system with USB port disabled does not recognize the dock even though Type-C dock override is enabled in the BIOS. This issue occurs after the system restart.

        Some new functionality has also been added:

        • Enhanced the CPU thermal stability.

For me, running the following after each boot fixes the issue until next reboot:

sudo rmmod intel_rapl_msr
sudo modprobe intel_rapl_msr
mazzz1y commented 3 years ago

For me, running the following after each boot fixes the issue until next reboot:

And it stuck on 1800mhz instead of 400 right? It's known behavior. You can install latest version of thermald instead of module unload.

Issue with 1800mhz still not fixed

mazzz1y commented 3 years ago

Dell released bios 1.8.2 for 7x20 laptops, but issue still persist with the new version

n0rad commented 3 years ago

It's better for me following @ftsogr comment : https://github.com/erpalma/throttled/issues/255#issuecomment-903144537

mazzz1y commented 3 years ago

@n0rad issue with stucking on 1800mhz still not fixed and video performance is very poor due to power limit. Compared with lenovo/msi laptops

Instead of modules reload you can install latest version of thermald and remove any other related things

mazzz1y commented 3 years ago

@carathorys

new bios was released for your model, can you please check it?

https://www.dell.com/support/home/ru-rs/drivers/driversdetails?driverid=3fg6d&oscode=wt64a&productcode=latitude-5420-laptop

mazzz1y commented 3 years ago

Dell released 1.9.1 bios version for 7x20, but nothing has changed

carathorys commented 3 years ago

@carathorys

new bios was released for your model, can you please check it?

https://www.dell.com/support/home/ru-rs/drivers/driversdetails?driverid=3fg6d&oscode=wt64a&productcode=latitude-5420-laptop

I've updated, and I'm still experiencing the same issue: throttled to 1.4Ghz with thermald enabled. Today we're sending the laptop back to the retailer, because I've experienced some other memory issues (sometimes the bios reports memory issues).

zamazan4ik commented 3 years ago

Not sure, can it help us or not: https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.15-Power-Management

So, if anyone can test on your local machine Linux 5.15 kernel with the corresponding PR and check it - would be awesome.

mazzz1y commented 3 years ago

So, if anyone can test on your local machine Linux 5.15 kernel with the corresponding PR and check it - would be awesome.

I will build and test kernel when 5.15-rc1 will be released

mazzz1y commented 3 years ago

Srinivas(maintainer of this repo) should know if power-related changes in 5.15 can help

@spandruvada what do you think? thanks

twtd commented 3 years ago

same problem dell 7400 i5-8265u last week instaled 20.04 pop_os lts with kernel 5.11 before was pop_os 20.10 with kernel 5.4 lts and all was norm now on 20.04 (5.11) cpu throttles to 400 MHz with no reason bios last update 1.13.0

sorry for my awful english...

mazzz1y commented 3 years ago

@twtd try to install latest version of thermald and check

twtd commented 3 years ago

@twtd try to install latest version of thermald and check

@dmirubtsov i found issue #291...but can you help me pls install last version, because with apt i cant do it

@dmirubtsov i think i did it ))) now i'll test it

twtd commented 3 years ago

@dmirubtsov issue with 400Mhz solved, but when I do sudo systemctl status thermald.service output is

сен 01 18:35:03 pop-os systemd[1]: Starting Thermal Daemon Service...
сен 01 18:35:03 pop-os systemd[1]: Started Thermal Daemon Service.
сен 01 18:35:03 pop-os thermald[735]: 22 CPUID levels; family:model:stepping 0x6:8e:c (6:142:12)
сен 01 18:35:04 pop-os thermald[735]: 22 CPUID levels; family:model:stepping 0x6:8e:c (6:142:12)
сен 01 18:35:04 pop-os thermald[735]: Polling mode is enabled: 4
сен 01 18:35:04 pop-os thermald[735]: sensor id 12 : No temp sysfs for reading raw temp
сен 01 18:35:04 pop-os thermald[735]: sensor id 12 : No temp sysfs for reading raw temp
сен 01 18:35:04 pop-os thermald[735]: sensor id 12 : No temp sysfs for reading raw temp
сен 01 18:35:08 pop-os thermald[735]: Unable to find a zone for TVGA

What does it mean?

mazzz1y commented 3 years ago

@twtd I'm not a developer of thermald proejct. As I understand it is ok and you can ignore these messages.

If you have some trouble with it please open new issue

antoniomuso commented 3 years ago

There is some configuration of thermal_deamon to increase the frequencies from 1400MHz to a higher value?

mazzz1y commented 3 years ago

There is some configuration of thermal_deamon to increase the frequencies from 1400MHz to a higher value?

no, it is still not fixed

mazzz1y commented 2 years ago

Problem still persist with the new 1.9.3 bios version(Latitude 7x20)

mazzz1y commented 2 years ago

Looks like Dell doesn't care about this issue, so I switched to Lenovo laptop.

Thank you all. I will follow this issue until I sell the Dell laptop

mazzz1y commented 2 years ago

Nothing has changed after update to Bios 1.9.6(Latitude 7x20)

aitorpazos commented 2 years ago

I have just updated my Latitude 7x20 to latest System Firmware 1.11.3 and this issue seems to be fixed for me. I now see the CPU (11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz) going beyond 3GHz as needed.

errogaht commented 2 years ago

dell 5421 i7-11850h bios 1.6.1 nothing changed. throttled to 800mhz after 10 seconds of running s-tui

Grtschnk commented 2 years ago

Same issue: Long time (+2s) Power throttling at 10-12W. It can boost but quickly goes down and stay around 1,800Mhz. Latitude 7420, BIOS v1.22.2, Arch - up to date.

Not sure where exactly the fix is to be found: thermald, kernel, intel, dell or throttled.. I'll probably give up for now and just hope it will be fixed one day (only blaming the big vendors here, not any of the open source projects ;) ).

VitaliiSerdiuk commented 2 years ago

@Grtschnk
@dmirubtsov Dell Latitude 5420 with latest BIOS - 1.13.1 Ubuntu 20.04 Kernel 5.14.21 - ppa.launchpad.net/tuxinvader/lts-mainline/ubuntu Intel thermal_daemon 2.4.6 - https://github.com/intel/thermal_daemon/. (NOT from Ubuntu or Debian repository) SpeedStep=Disabled works and with BIOS Ultra performance power setup for me work up to 3.1GHz for 2 minutes then drops to 2.3GHz because of 95 Celsius. Optimized power bios setup got 2.3 GHz with 65 Celsius

zamazan4ik commented 2 years ago

@VitaliiSerdiuk if you enable SpeedStep, do you have drops to 400 Mhz?

I just want to understand, what is the reason for such drops. Now I am on Dell Latitude 5410 and have the same problem. If I have a CPU-only workload, it works almost fine. If I have any GPU-intensive workload (any game), my CPU drops to 400 Mhz from time to time.

VitaliiSerdiuk commented 2 years ago

@zamazan4ik SpeedStep not have huge impact. Biggest impact as for my understanding

  1. BIOS latest version
  2. BIOS power management setup(Ultra performance or Optimized or ...)
  3. Thermal_daemon version from https://github.com/intel/thermal_daemon/. (ubuntu or debian repository doesn't have latest update)
  4. 5.14.21 kernel. Somehow issue apears on 5.15.7 kernel

But I not checked it with GPU stress test. I test CPU only via stress -c 8 and check result via s-tui

Grtschnk commented 2 years ago

@VitaliiSerdiuk Thank you for the hint, but unfortunately it does not help with my system. :/ (Latitude 7420, Ultra Performance mode in BIOS, BIOS version 1.12.2, Kernel 5.15.5, thermald 2.4.6-1)

If I disable SpeedStep ,i7z reports Turbo as turned off. If I disable SpeedShift Turbo is reported availble, but higher frequencies aren't used until I change the governor manually. And then it still jumps back to power throttling after a few seconds.

5420 and 7420 use different BIOS versions numbers; not sure how much they actually differ

antoniomuso commented 2 years ago

@Grtschnk @dmirubtsov Dell Latitude 5420 with latest BIOS - 1.13.1 Ubuntu 20.04 Kernel 5.15.4 - ppa.launchpad.net/tuxinvader/lts-mainline/ubuntu Intel thermal_daemon 2.4.6 - https://github.com/intel/thermal_daemon/. SpeedStep=Disabled works and with BIOS Ultra performance power setup for me work up to 3.1GHz for 2 minutes then drops to 2.3GHz because of 95 Celsius. Optimized power bios setup got 2.3 GHz with 65 Celsius

Me too, I fixed the problem with the new bios update. Dell 5420

ColinIanKing commented 2 years ago

I wish we could find out exactly the BIOS upgrades are fixing. That would be super insightful information.