intel / thermal_daemon

Thermal daemon for IA
GNU General Public License v2.0
554 stars 118 forks source link

Dell Lattitude 5420 (TGL) throttled to 400MHz CPU / 100MHz GPU after 30s #293

Closed majanes-intel closed 2 years ago

majanes-intel commented 3 years ago

Kernel: 5.11.3 Debian: Testing thermald: 2.4.3 (debian unstable) processor: i7-1185G7 -- 28 W TDP

After running power-intensive workloads for a short amount of time, the CPU and/or GPU will be throttled down drastically to ~10% of peak.

Running turbostat reveals that the peak current is ~16W, far below the TDP limit.

Running lm-sensors shows that the peak temp is ~50C, far below the limit.

After reading #291 and #280, I enabled debug logs for thermald. thermald.log

@spandruvada let me know if more information is needed. I can also bring the system to you in JF1. Mesa team will be using this laptop model for perf analysis.

majanes-intel commented 3 years ago

I neglected to mention: running an identical workload on windows completes with no degradation in power. The system gets much warmer and the fans run at a clearly higher speed. Based on this observation, it seems clear that the system is not being limited by some physical thermal problem.

spandruvada commented 3 years ago

On Fri, 2021-03-05 at 12:34 -0800, Mark Janes Intel wrote:

Kernel: 5.11.3 Debian: Testing thermald: 2.4.3 (debian unstable) processor: i7-1185G7 -- 28 W TDP

After running power-intensive workloads for a short amount of time, the CPU and/or GPU will be throttled down drastically to ~10% of peak.

Running turbostat reveals that the peak current is ~16W, far below the TDP limit.

Running lm-sensors shows that the peak temp is ~50C, far below the limit.

After reading #291 and #280, I enabled debug logs for thermald. thermald.log

@spandruvada let me know if more information is needed. I can also bring the system to you in JF1. Mesa team will be using this laptop model for perf analysis.

If this is the complete log, then as you observed that non of the temperature triggered any throttling.

Bring the system, we can take a look.

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

spandruvada commented 3 years ago

On Fri, 2021-03-05 at 12:37 -0800, Mark Janes Intel wrote:

I neglected to mention: running an identical workload on windows completes with no degradation in power. The system gets much warmer and the fans run at a clearly higher speed. Based on this observation, it seems clear that the system is not being limited by some physical thermal problem. Probably some stetting, which Windows is aware of it. We can't compare with Windows as we don't have support of several conditions in the table on this system, so using best effort. Particularly power slider and probably fan control stuff.

So we need to find what else we can do with these limitations.

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

spandruvada commented 3 years ago

I see different behavior setting on 2.4.3 version from this repository and version in debian. The power limit is not getting set in Debian version. So is Debian back-porting patches? If that is the case they should have different private version. Who can help here? @ColinIanKing

ColinIanKing commented 3 years ago

I'll sort that out first thing Tuesday.

majanes-intel commented 3 years ago

@spandruvada thanks for the work to figure out why this system was turning the gpu down to 100mhz. Your test branch improves the situation substantially, although it looks like there is still a long way to go. Running longer benchmarks, I can see that the CoreTmp climbs all the way to 72 degrees, with the GFXAMHz stable at the 1350MHz maximum. Tthe PkgWatt is around 25W, near the TDP limit. After that, power is cut to the system, limiting the GPU to 400MHz. The temp declines steadily, with the PkgWatt at 12W. For a short duration, the GFXAMHz oscillates between 400 and 1000, then stays at 400MHz. The temp declines to 40 degrees by the time the benchmark is done.

lm-sensors reports that the package max temp is 100 degrees Celsius. Is that accurate/realistic? If so, then it seems like thermald should wait longer before cutting power. If not, then it seems like thermald could settle on a much higher current for the package... at 12W, the package temp declines below what is necessary and performance suffers.

I used unigine heaven for this data point. I took a look ath the Thermal Analysis Tool on Windows, but I couldn't see how to get similar data from that platform. If you can give me some pointers, I should be able to at least understand what frequency/power levels windows achieves, and what the stable max temp is.

majanes-intel commented 3 years ago

When I booted to the windows partition, I noticed that updates were running in the background, which can perturb performance measurements. I let the system complete a full software update, which updated the firmware on the device. With the firmware update, I now get a stable 1000MHz GPU clock, with the package temp stable at 50 degrees Celsius. While this is much closer to optimal, It still seems to me that the package could target a higher package temperature.

benzea commented 3 years ago

So, the power slider condition we could support (either using a default value or pulling a value from p-p-d).

However, that would only help if we can resolve the \_SB_.PCI0.LPCB.ECDV.NGFF sensor. And, even if we do that, the OEM conditions coming through ACPI might still not have a sane value to proceed.

zamazan4ik commented 3 years ago

I've met the same issue. @benzea does thermald 2.4.4 resolve the issue?

I've built 2.4.4 for Fedora 34, installed it, and now have almost constant 1700Mhz instead of 400Mhz. That's fine, but my CPU temp is still too low (~54C), so I am sure that the CPU can gain a higher clock speed. Is it possible somehow?

zamazan4ik commented 3 years ago

If it will help - that's a log from journalctl -r for thermald 2.4.4 on Fedora 34, which is launched with options --systemd --dbus-enable --adaptive:

мая 02 06:05:24 localhost.localdomain thermald[3010]: ppcc limits is less than def PL1 max power :28000000 check thermal-conf.xml.auto
мая 02 06:05:24 localhost.localdomain thermald[3010]: sensor id 10 : No temp sysfs for reading raw temp
мая 02 06:05:24 localhost.localdomain thermald[3010]: sensor id 10 : No temp sysfs for reading raw temp
мая 02 06:05:24 localhost.localdomain thermald[3010]: sensor id 10 : No temp sysfs for reading raw temp
мая 02 06:05:23 localhost.localdomain thermald[3010]: Polling mode is enabled: 4
мая 02 06:05:23 localhost.localdomain thermald[3010]: 27 CPUID levels; family:model:stepping 0x6:8c:1 (6:140:1)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported conditions are present
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:16 localhost.localdomain thermald[3010]: 27 CPUID levels; family:model:stepping 0x6:8c:1 (6:140:1)
мая 02 06:05:16 localhost.localdomain systemd[1]: Started Thermal Daemon Service.

Laptop: Dell Latitude 5420 with 11th Gen Intel(R) Core(TM) i7-1165G7 CPU.

benzea commented 3 years ago

I've met the same issue. @benzea does thermald 2.4.4 resolve the issue?

I've built 2.4.4 for Fedora 34, installed it, and now have almost constant 1700Mhz instead of 400Mhz. That's fine, but my CPU temp is still too low (~54C), so I am sure that the CPU can gain a higher clock speed. Is it possible somehow?

Oh, a newer thermald for Fedora would help?

Sorry about that. I thought I had picked up the important patches downstream already (even if I had an older version). I can update the package so that others benefit from that.

zamazan4ik commented 3 years ago

Yeah, 2.4.4 helps somehow on Fedora but not completely resolve the issue. So without thermald 2.4.4 (with older thermald version or without it) is still downclocked to 400Mhz after ~30 secs. With thermald 2.4.4 the highest clock is 1700 Mhz. Would be awesome if you'll build thermald 2.4.4 for Fedora :)

It's still too low since the usual clock for the CPU is 2800Mhz. And I have no idea how it can be fixed :(

zamazan4ik commented 3 years ago

@benzea any news about Fedora updates?

benzea commented 3 years ago

@benzea any news about Fedora updates?

On its way now.

zamazan4ik commented 3 years ago

I am not familiar with modern Linux CPU scheduling but I think the real root of the issue is some bugs in intel_pstate implementation in Linux kernel. Because on Windows I can gain stable 2.8 Ghz CPU clock on the same hardware. On Linux (Fedora 34) without thermald I can get only 400 Mhz and with thermald - 1.7 GHz.

Maybe anyone from thermald team can provide more information. I will try to test another Dell Latitude 5420. Also in a few days I'll test Dell Latitude 5410 (hope it'll work better).

By the way - with modern Intel CPUs is using Thermald necessary or not?

benzea commented 3 years ago

I am not familiar with modern Linux CPU scheduling but I think the real root of the issue is some bugs in intel_pstate implementation in Linux kernel. Because on Windows I can gain stable 2.8 Ghz CPU clock on the same hardware. On Linux (Fedora 34) without thermald I can get only 400 Mhz and with thermald - 1.7 GHz.

Please don't jump to such conclusions. The problem is that we need to do thermal management in userspace. To do so, we need to parse data from ACPI which we are not fully implementing because Intel is not publishing the specification. And, on top of that, there may also be vendor specific things.

i.e. probably мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN) is the issue. If you figure out whwat that condition means, then one might implement it and it will likely help you.

Maybe anyone from thermald team can provide more information. I will try to test another Dell Latitude 5420. Also in a few days I'll test Dell Latitude 5410 (hope it'll work better).

By the way - with modern Intel CPUs is using Thermald necessary or not?

Yes.

zamazan4ik commented 3 years ago

@benzea Thanks! Can you please describe to me a little bit more, what is the real difference in thermal management between the intel_pstate subsystem and thermald? Or just provide a link, where I can read about it. Thanks in advance!

If you figure out what that condition means, then one might implement it and it will likely help you.

Do you have any suggestions, how can I debug it? Maybe there is some already existing guide for it. I am ready to invest some time into it and assist you as much as I can.

benzea commented 3 years ago

@benzea Thanks! Can you please describe to me a little bit more, what is the real difference in thermal management between the intel_pstate subsystem and thermald? Or just provide a link, where I can read about it. Thanks in advance!

If you figure out what that condition means, then one might implement it and it will likely help you.

Do you have any suggestions, how can I debug it? Maybe there is some already existing guide for it. I am ready to invest some time into it and assist you as much as I can.

Not really. You can enable debug logging for thermald and it'll dump more detailed information. It might be possible to guess what the condition is based on by looking at the values and the various limits that are being applied.

At the end, if we can just emulate a sane default value, we might not even need to know the exact meaning. For power-slider we just assume a "balanced" performance right now for example.

spandruvada commented 3 years ago

I pushed another change to fix the performance gap once you update BIOS on this system.

mazzz1y commented 3 years ago

Absolutelly the same issue with Latitude 7520

spandruvada commented 3 years ago

If the issue is same on 7520, does the latest thermald fix the issue?

mazzz1y commented 3 years ago

The same as for @ZaMaZaN4iK

[root@dell tmp]# thermald --version
2.4.6

With the latest version CPU stuck on 1800mhz, without thermald -- 400mhz

zamazan4ik commented 3 years ago

Since now I have Dell Latitude 5410 - I cannot test the latest thermald on 5420. I'll try to test the latest thermald on the 5410. I hope @benzea ported latest changes to the Fedora version.

benzea commented 3 years ago

Since now I have Dell Latitude 5410 - I cannot test the latest thermald on 5420. I'll try to test the latest thermald on the 5410. I hope @benzea ported latest changes to the Fedora version.

Fedora 34 and 35 both have thermald 2.4.6 currently.

mazzz1y commented 3 years ago

I've attached debug log with the latest(2.4.6) version of thermald. Not sure if it's helpful

In the log we can see dropping frequency to 1800mhz(temp down to 55 from 73) after a few seconds of stress -c 8

thermald --no-daemon --loglevel=debug --dbus-enable > /tmp/thermald.log

thermald.log

spandruvada commented 3 years ago

Is this log with --adaptive option?

On Tue, 2021-06-29 at 04:56 -0700, Dmitry Rubtsov wrote:

I've attached debug log with the latest(2.4.6) version of thermald. Not sure if it's helpful In the log we can see dropping frequency to 1800mhz(temp down to 55 from 73) after a few seconds of stress -c 8 thermald.log — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mazzz1y commented 3 years ago

no, I attached a new with adaptive option

thermald-adaptive.log

spandruvada commented 3 years ago

The concern is that there are no sensors:

RN]Unable to find a zone for TSKN [1624968504][WARN]Unable to find a zone for NGFF [1624968504][WARN]Unable to find a zone for TMEM [1624968504][WARN]Unable to find a zone for TMEM [1624968504][WARN]Unable to find a zone for TMEM [1624968504][WARN]Unable to find a zone for TMEM [1624968504][WARN]Unable to find a zone for TSSD [1624968504][DEBUG]check trip zone:0:0

What is the kernel version? Check /sys/class/thermal/thermal_zone*/type if these sensors exist.

On Tue, 2021-06-29 at 05:10 -0700, Dmitry Rubtsov wrote:

no, I attached a new with adaptive option thermald-adaptive.log — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mazzz1y commented 3 years ago
[root@dell ~]# cat /sys/class/thermal/thermal_zone*/type
INT3400 Thermal
TCPU
iwlwifi_1
x86_pkg_temp
[root@dell ~]# uname -a
Linux dell 5.12.13-arch1-2 #1 SMP PREEMPT Fri, 25 Jun 2021 22:56:51 +0000 x86_64 GNU/Linux
spandruvada commented 3 years ago

May be try to update to the latest BIOS. This doesn't show sensors described in the thermal configuration.

Do you see driver loaded lsmod | grep -i int3

What is the output of ls /sys/bus/platform/devices/

Thanks.

On Tue, 2021-06-29 at 05:35 -0700, Dmitry Rubtsov wrote:

@. ~]# cat /sys/class/thermal/thermal_zone/type INT3400 Thermal TCPU iwlwifi_1 x86_pkg_temp **@.*** ~]# uname -a Linux dell 5.12.13-arch1-2 #1 SMP PREEMPT Fri, 25 Jun 2021 22:56:51 +0000 x86_64 GNU/Linux — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mazzz1y commented 3 years ago

Thanks for your reply

May be try to update to the latest BIOS.

BIOS updated to the latest version(1.7.1)

Do you see driver loaded

dell ~ » lsmod | grep -i int3
int340x_thermal_zone    20480  1 processor_thermal_device
int3400_thermal        20480  0
acpi_thermal_rel       16384  1 int3400_thermal

ls /sys/bus/platform/devices/


dell ~ »  ls /sys/bus/platform/devices/
ACPI0003:00        efivars.0                  HID-SENSOR-2000e1.12.auto  HID-SENSOR-2000e1.22.auto  HID-SENSOR-2000e1.6.auto  INT33A1:00   intel_rapl_msr.0  PNP0C0E:00  regulatory.0
ACPI000C:00        HID-SENSOR-200001.18.auto  HID-SENSOR-2000e1.13.auto  HID-SENSOR-2000e1.23.auto  HID-SENSOR-2000e1.7.auto  INT33D2:00   iTCO_wdt          PNP0C14:00  rtc-efi.0
ACPI000E:00        HID-SENSOR-200001.1.auto   HID-SENSOR-2000e1.14.auto  HID-SENSOR-2000e1.24.auto  HID-SENSOR-2000e1.8.auto  INT33D3:00   microcode         PNP0C14:01  rtsx_pci_sdmmc.0
alarmtimer.0.auto  HID-SENSOR-200001.27.auto  HID-SENSOR-2000e1.15.auto  HID-SENSOR-2000e1.25.auto  HID-SENSOR-INT-020b       INT34C5:00   pcspkr            PNP0C14:02  serial8250
coretemp.0         HID-SENSOR-200001.9.auto   HID-SENSOR-2000e1.16.auto  HID-SENSOR-2000e1.26.auto  i2c_designware.0          INTC1040:00  PNP0103:00        PNP0C14:03  snd-soc-dummy
dcdbas             HID-SENSOR-200041.10.auto  HID-SENSOR-2000e1.17.auto  HID-SENSOR-2000e1.2.auto   i2c_designware.1          INTC1043:00  PNP0C09:00        PNP0C14:04  STM0125:00
dell-laptop        HID-SENSOR-200073.28.auto  HID-SENSOR-2000e1.19.auto  HID-SENSOR-2000e1.3.auto   i8042                     INTC1043:01  PNP0C0A:00        PNP0C14:05  USBC000:00
dell-smbios.0      HID-SENSOR-200076.29.auto  HID-SENSOR-2000e1.20.auto  HID-SENSOR-2000e1.4.auto   idma64.0                  INTC1043:02  PNP0C0C:00        PNP0C14:06
efi-framebuffer.0  HID-SENSOR-2000e1.11.auto  HID-SENSOR-2000e1.21.auto  HID-SENSOR-2000e1.5.auto   idma64.1                  INTC1051:00  PNP0C0D:00        reg-dummy
mazzz1y commented 3 years ago

I found that for unknown reason module int3403_thermal was blacklisted on my laptop(I think it is an old artifact), I've removed this entry from /etc/modprobe.d and rebooted.

Now my lsmod looks like this:

dell ~ » lsmod | grep -i int3                                                   
int3403_thermal        20480  0
int340x_thermal_zone    20480  2 int3403_thermal,processor_thermal_device
int3400_thermal        20480  0
acpi_thermal_rel       16384  1 int3400_thermal

But problem persist, after a few seconds of cpu load frequency locked on 1800mhz.

Here I attached new log from thermald: thermald-adaptive.log

spandruvada commented 3 years ago

Now better. If you have Windows, compare with that.

On Tue, 2021-06-29 at 13:48 -0700, Dmitry Rubtsov wrote:

I found that for unknown reason module int3403_thermal was blacklisted on my laptop(think it is an old artifact), I've removed this entry from /etc/modprobe.d and rebooted. Now my lsmod looks like this: dell ~ » lsmod | grep -i int3
int3403_thermal 20480 0 int340x_thermal_zone 20480 2 int3403_thermal,processor_thermal_device int3400_thermal 20480 0 acpi_thermal_rel 16384 1 int3400_thermal But problem persist, after a few seconds of cpu load frequency locked on 1800mhz. Here I attached new log from thermald: thermald-adaptive.log — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mazzz1y commented 3 years ago

sorry, I don't have it,

dell support assures me that it is not normal, and laptop should throttle only on 100C

here is screenshot from ThermalMonitor(stress -c 8):

2021-06-30_00-16

spandruvada commented 3 years ago

This is not about temperature, but power limits. Your temperature can still be lower but power limits may have been reached.

Does your log attached before, covers the full scenario, from startup to when you get throttled to 1800MHz?

On Tue, 2021-06-29 at 14:01 -0700, Dmitry Rubtsov wrote:

sorry, I don't have it, support assures me that it is not normal, and laptop should throttle only on 100C here is screenshot from ThermalMonitor(stress -c 8):

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

benzea commented 3 years ago

The log is only 21s long. And, to me it looks like RAPL is initialy set to 22.1W (maybe from the BIOS?) and it is increased in 0.1W steps every few seconds. So, if you wait longer, then the speed may very well increase.

mazzz1y commented 3 years ago

This is not about temperature, but power limits. Your temperature can still be lower but power limits may have been reached.

Yes, but for unknown reason. Without thermald it will be throttled to 400mhz, so I think that thermald can affect it. Absolutelly the same as for @ZaMaZaN4iK, but for me it stuck on 1800mhz, not 1700:

Yeah, 2.4.4 helps somehow on Fedora but not completely resolve the issue. So without thermald 2.4.4 (with older thermald version or without it) is still downclocked to 400Mhz after ~30 secs. With thermald 2.4.4 the highest clock is 1700 Mhz.


Does your log attached before, covers the full scenario, from startup to when you get throttled to 1800MHz?

No,

1) I started thermald manually:

thermald --no-daemon --loglevel=debug --dbus-enable --adaptive  > /tmp/thermald-adaptive.log

2) In another terminal session I started stress

stress -c 8

3) Waited a few seconds and see that frequency dropped to 1800mhz 4) Stopped thermald and send log here


if you wait longer, then the speed may very well increase.

On huge cpu load frequency locked on 1800mhz and not increasing anymore.

spandruvada commented 3 years ago

What @benzea is saying that First start thermald. Wait for couple of minutes (The power level will reach max). Then do the stress -c 8 test. Also what is? cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency

Eventually with stress -C 8 you will reach this frequency,

mazzz1y commented 3 years ago

Also what is?

cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency
1800000

Interesting thing, but according to link it should be 3ghz. Do you know why is it? I see that on another laptop with i5 11gen CPU it shows correct frequency

What @benzea is saying that First start thermald. Wait for couple of minutes (The power level will reach max). Then do the stress -c 8 test.

Sure, I can do it a bit later, I will provide result. But it will be the same based on my experience with this laptop and frequency will be locked on 1800mhz

spandruvada commented 3 years ago

This values is from the HW, so this is what it is configured for.

mazzz1y commented 3 years ago

Sure, but in windows it shows correct value and can handle higher frequency. So it's problem not with thermald? Do you have any another guess? As we can see, not only me have such a problem.

Thanks a lot for your support

mazzz1y commented 3 years ago

I downloaded Windows 10 and tried to reproduce the problem.

Windows shows base frequency as 1800mhz(I'm wrong before^ looks like it depends on the laptop vendor's settings). I tested CPU with stress test and see that it can handle 3300mhz without any problems for a long time.

Now I rebooted into linux and see that CPU still stuck on 1800mhz(I googled that it is 15W). So I'm sure that problem with the some linux component, but I don't know which one.

What changes was made in the last version of thermald, this version fixes the issue partially, there can't be some related problem?

spandruvada commented 3 years ago

Run this and attach the tar file. https://github.com/intel/thermal_daemon/blob/master/test/thermal-debug-dump-fedora.sh

or for Ubuntu https://github.com/intel/thermal_daemon/blob/master/test/thermal-debug-dump-ubuntu.sh

On Wed, 2021-06-30 at 14:09 -0700, Dmitry Rubtsov wrote:

I downloaded Windows 10 and tried to reproduce the problem. Windows shows base frequency as 1800mhz(I'm wrong before^ looks like it depends on the laptop vendor's settings). I tested CPU with stress test and see that it can handle 3300mhz without any problems for a long time. Now I rebooted into linux and see that CPU still stuck on 1800mhz(I googled that it is 15W). So I'm sure that problem with the some linux component, but I don't know which one. What changes was made in the last version of thermald, this version fixes the issue partially, there can't be some related problem? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mazzz1y commented 3 years ago

Sure, please check: 01141200.tar.gz

I've changed bz2 to gz, because github doesn't allow bz2 uploads

Also I fixed a small typo in the script: https://github.com/intel/thermal_daemon/pull/308

ColinIanKing commented 3 years ago

On 01/07/2021 09:27, Dmitry Rubtsov wrote:

Sure, please check: 01141200.tar.gz https://github.com/intel/thermal_daemon/files/6746639/01141200.tar.gz

I've changed bz2 to gz, because github doesn't allow bz2 uploads

Also I fixed a small typo in the script:

  • stress-ng ---cpu 16
  • stress-ng --cpu 16

if you use stress-ng --cpu -1 then stress-ng will automatically allocate 1 stressor per CPU

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/intel/thermal_daemon/issues/293#issuecomment-872039461, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHCRL7AMBI273KRIQPKFP3TVQRHDANCNFSM4YV3FMVQ.

mazzz1y commented 3 years ago

if you use stress-ng --cpu -1 then stress-ng will automatically allocate 1 stressor per CPU

I just removed extra hyphen, please check the script

dell ~/throttle-debug » stress-ng ---cpu 16 
stress-ng: unrecognized option '---cpu'
Try 'stress-ng --help' for more information.
mazzz1y commented 3 years ago

@ColinIanKing I suggested your recommendation in https://github.com/intel/thermal_daemon/pull/308 pull request, thanks

mazzz1y commented 3 years ago

I tried with 5.13.0 mainline kernel — nothing has changed

spandruvada commented 3 years ago

The problem is that TMEM sensor reaches its limits of 42C in 4 seconds,, so the system is throttled from max power. Even at the start the temperature is 39C. So not much margin. Not sure what can be done here,

mazzz1y commented 3 years ago

Do you have any idea why in windows it working properly? Is it possible to ignore this sensor?