intel / thermal_daemon

Thermal daemon for IA
GNU General Public License v2.0
540 stars 117 forks source link

Dell Lattitude 5x20/7x20 throttled to ~1500MHz/1800MHz with 2.4.8 #334

Closed VitaliiSerdiuk closed 2 years ago

VitaliiSerdiuk commented 2 years ago

Countinue https://github.com/intel/thermal_daemon/issues/293# as it wasn't properly fixed.

When using google meet video conference + another browser search CPU throttled to 1500 MHz Latitude 5420 BIOS 1.14.1 Ubuntu - 20.04 Kernel - 5.15.13

binboum commented 2 years ago

What version of thermald are you using ? thermald --version

Have you compiled version 2.4.8 and installed ?

I have a Latitude 7490, since the last release, I'm no longer stuck at 400 MHz

sebastianha commented 2 years ago

It only affects the 7x20 series. Also "stuck at 400MHZ" is not correct, with thermald it is stuck at 1800MHz after a short while. The issue is misleading, you have to dig through #293 for all information. But the original issue has been closed without reason, so this issue here is only to remind that it has not been solved I think.

VitaliiSerdiuk commented 2 years ago

@binboum yes, I compiled 2.4.8 version and I always stuck with 1500MHz under load on 5420

spamik commented 2 years ago

One funny thing, not sure if it can mean something. I have tried reloading these kernel modules as I have seen in throttled ticket:

rmmod intel_rapl_msr rmmod processor_thermal_device_pci_legacy rmmod processor_thermal_device rmmod processor_thermal_rapl rmmod intel_rapl_common rmmod intel_powerclamp

modprobe intel_powerclamp modprobe intel_rapl_common modprobe processor_thermal_rapl modprobe processor_thermal_device modprobe processor_thermal_device_pci_legacy modprobe intel_rapl_msr

Behaviour after that is almost same (still fixed to 1800MHz under heavy load with temperatures on 50°C) but two things changed: start of the load is better - I can see for 1-2 seconds frequency over 4GHz and CPU temperature around 90°C (before that I've never reached more than 60). And second strange thing I can hear some clicking noise coming from laptop :-) It happend sometimes - few clicks in a row and after that several tens of seconds or minutes silent...

sebastianha commented 2 years ago

Dell is collection information: https://www.dell.com/community/Latitude/Latitude-5420-7420-7520-CPU-Throttling-Issue-on-Linux/m-p/8129749/highlight/true#M39458

pjssilva commented 2 years ago

It seems also to affect the Lat. 5421 with the i5-11500H processor. It dips to 800 MHz under full load for some time, and bounces back to the correct 2900 MHz in a continuous cycle.

In my system, a workaround is to change the power profile in the BIOS from "Optimized" to "Ultra performance" (you get louder fans, but the full processor speed). It usually also works well with the "Cool" and "Quiet" profiles. The main problem seems to be the default "Optimized" profile.

Obs: I am using Pop-OS 21.10, kernel 5.15.15 and self compiled 2.4.8 thermald.

binboum commented 2 years ago

I think I found a workaround, when I load via USB PD I don't have the problem.

I confirm on the normal charge the problems mentioned.

sebastianha commented 2 years ago

What exactly do you mean with Charging via USB or "normal"? My 7320 has only USB-C for charging.

binboum commented 2 years ago

What exactly do you mean with Charging via USB or "normal"? My 7320 has only USB-C for charging.

Like : https://www.dell.com/en-uk/work/shop/dell-uk-65-watt-3-prong-ac-adapter-with-1meter-power-cord/apd/450-aixz/pc-accessories

sebastianha commented 2 years ago

Which model do you have? My 7320 has no "normal" power plug.

PhilipGB commented 2 years ago

I suspect he means charging by the usb-c wall charger vs power delivery like from a dock device

sebastianha commented 2 years ago

This is something I have already checked but did not solve the problem for me.

PhilipGB commented 2 years ago

This is something I have already checked but did not solve the problem for me.

Dell dock or other brand?

I've not tested this myself yet but I've a Dell TB dock due to arrive soon so I'll feedback

sebastianha commented 2 years ago

Both with original AC adapter and with my connected 90w Dell monitor. I also checked BIOS, both are recognized correctly.

VitaliiSerdiuk commented 2 years ago

Looks like Dell completely doesn't care about Linux/Open source...

JoshuaPK commented 2 years ago

5.16.10-1.el8.elrepo.x86_64 it still exists here as well. There's one other thing that just came to mind: the fan sensors don't work. In all of my previous Dell laptops, the system is able to read fan speed from lm_sensors. In this case the fan speed is not available. I wonder if the lack of fan speed data is causing thermald to make some assumptions that aren't correct.

spandruvada commented 2 years ago

Please rub https://github.com/intel/thermal_daemon/blob/master/test/thermal-debug-dump-fedora.sh or https://github.com/intel/thermal_daemon/blob/master/test/thermal-debug-dump-ubuntu.sh. I can check thermal tables first.

JoshuaPK commented 2 years ago

Here are the requested dumps. [23111824.tar.gz](https://github.com/intel/thermal_daemon/files/8126088/23111824.tar.gz)

sebastianha commented 2 years ago

@JoshuaPK for which model are these? I could provide Latitude 7320 with openSuse if needed.

pjssilva commented 2 years ago

Here you have the file for a 5421 with i5-11500H. The clock decrease to 800 MHz even with the laptop sitting on top a Zalman laptop base with a fan 23173225.tar.gz .

JoshuaPK commented 2 years ago

@sebastianha my apologies. I have a Latitude 5420 with an i5-1145G7. I have seen a number of scenarios, but the most frequent are throttling down to 1.5GHz and throttling down to 400MHz. In my case the throttling goes away when the load decreases. So, for example, if I try to compress an ISO file with 7za it will throttle, then if I stop 7za it will jump back up to 4GHz. This is thermald 2.4.8 that I compiled from source, running on Rocky Linux 8.5.

spandruvada commented 2 years ago

issue_334_stress_ng

spandruvada commented 2 years ago

The guaranteed frequency is 1500 MHz on this system. With 100% load system was able to run about 1000 MHz above guaranteed for 80% of the time. What is the expectation? System can't sustain turbo forever.

You can try to manually adjust power and try if you can prevent system for reaching peak turbo and keep above 1500Hz without thermal throttle for test: Try

systemctl disable thermald

Reboot echo 28000000 > /sys/devices/virtual/powercap/intel-rapl-mmio/intel-rapl-mmio:0/constraint_0_power_limit_uw echo 28000000 > /sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/constraint_0_power_limit_uw You can run turbostat turbostat --show Core,CPU,Busy%,Bzy_MHz,TSC_MHz -o turbostat.out and check frequencies

spandruvada commented 2 years ago

@pjssilva Somehow there is no turbostat output in the tar file. Can you see if you have turbostat on your system?

VitaliiSerdiuk commented 2 years ago

@spandruvada Please could you explain more about 'The guaranteed frequency is 1500 MHz on this system'? For Linux is 1500MHz and for Windows - 2600 MHz, am I correct? Base frequency for Intel Core i5-1145G7 - 2600 MHz. I not understand how it works. So big difference for different OS.

spandruvada commented 2 years ago

@pjssilva Can you build yourself with the attached? 0001-Test-patch-to-fail-to-run-adaptive.zip

unzip and git am 0001-Test-patch-to-fail-to-run-adaptive.patch then build using make (procedure in README.txt)

Then systemctl stop thermald thermald --loglevel=debug --no-daemon --adaptive Attach the log. Run some workload also at the same time.

spandruvada commented 2 years ago

[@VitaliiSerdiuk The guaranteed doesn't change with Linux or Windows. You have a power budget, either you apply more initially and get more short term perf or use it moderately to not drop to 1500 MHz. Linux can't match every conditions in the thermal table as Windows, so Windows may be doing better, but can't say without actually running similar workload. You are running a web/video workload which Windows can manage better based on usage of HW acceleration. So their power usage profile will be different so you may not be dropping to 1500MHz.

pjssilva commented 2 years ago

@spandruvada I have installed turbostat and reran the test from the thermal-debug-dump-ubuntu.sh (I run Pop-OS, Ubuntu is the closest). 23223022.tar.gz I will now try to apply the patch you asked in the other message above and will report the results.

pjssilva commented 2 years ago

@spandruvada Now the therlmald log after applying the 0001-Test-patch-to-fail-to-run-adaptive.zip above. Before getting the log I left the system running stress-ng for 10 minutes or so. During the test the system was running for some time at 3600-3700 MHz and then dropping fast to 800 MHz for some time and restarting this cycle. The expected sustained clock for this processor is at least 2400 MHz, so 800 MHz is clearly a problem.

thermald.log

JoshuaPK commented 2 years ago

This may seem like a silly question, but are some platforms safe to run without thermald? Using 7za to create an archive of a ~60gb directory. With thermald running, the system is throttled down to 1.5GHz and stays there, with the core temperature staying steady at around 135F. If I kill thermald, then the system fluctuates between 2.4 and 3.1GHz (and does not remain below 2.4GHz for any length of time) and the core temperature fluctuates around 140-150F. I also threw in 8 threads of stress-ng and a YouTube video and the processor still fluctuated around 2.4 but never dipped below 2 for any length of time (temps remained the same around 140). According to spec the maximum core temperature is 212F. It appears that in this case the processor is doing a good job of regulating itself. What am I missing?

sebastianha commented 2 years ago

@spandruvada I understand the case with the guaranteed frequency but the problem is, that on my system the fan is not spinning at 100% but the system is throttled down.

In my understanding throttling should only happen when all thermal regulators are maxed out, this means: fan 100%, CPU temperature at 95°C.

Under full load my system is running the fan at ~25-50% and temperature is at 50°C. There is definitely room for more power.

I will deliver the outputs of the scripts as soon I have some spare time left.

spandruvada commented 2 years ago

@https://github.com/pjssilva Still the same problem. I need to recheck what is in the table which is preventing this to load. I may generate another patch.

spandruvada commented 2 years ago

@JoshuaPK The processor has in built in control. But this is about other parts in the system and skin temperature is under spec. But manufacturer may already have made sure out of box. Also the system may already have all the power table configured correctly without thermald. So disable thermald and cold reboot and check, if you still get good performance and decide.

spandruvada commented 2 years ago

@sebastianha I am not sure if there is any fan control available. #cat /sys/class/thermal/cooling_device*/type, Do you see any other names other than Processor, LCD,intel_powerclamp? Something like "Fan" or "TFN"

sebastianha commented 2 years ago

No:

~> cat /sys/class/thermal/cooling_device*/type
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
intel_powerclamp
TCC Offset
spandruvada commented 2 years ago

So unfortunately we can' t control fans. Do you see same behavior as the plot I attached above?

sebastianha commented 2 years ago

No, I see something like this (manually plotted data):

Screenshot_20220224_165527

For a second I get full speed and a high temperature, then it instantly drops to ~2GHz and settles down to 1800MHz after some time. Temperature is always ~55°C and the fan did not kick in at all.

Update: What I noticed that the fan immediately kicks in when GPU or SSD is under load.

sebastianha commented 2 years ago

I also tested with the 0001-Patch: thermald-patch0001.txt

spandruvada commented 2 years ago

@pjssilva I did some silly mistake. Can you retry with this patch 0001-Test-patch-to-fail-to-run-adaptive-ver-2.zip

sebastianha commented 2 years ago

0001-Test-patch-to-fail-to-run-adaptive-ver-2.log

100% load, 50-55°C, no fan, 1800MHz fix the whole time.

pjssilva commented 2 years ago

@spandruvada It looks like you have something there! I ran the stress test for 15 minutes and the processor never dipped below 2.9GHz (which is the case clock of the Core i5-11500H configured with high TDP). Take a look at the log below. I will try some other tests, but it looks promising! thermald.log

Obs: The dip at the end of the test is because I stopped stress-ng.

spandruvada commented 2 years ago

@sebastianha, you have some other issue. First address issues who get stuck at 800MHz Please run https://github.com/intel/thermal_daemon/blob/master/test/thermal-debug-dump-ubuntu.sh or Fedora one there and attach the outputs

spandruvada commented 2 years ago

@https://github.com/pjssilva What is the make and model of your system?

pjssilva commented 2 years ago

@spandruvada My system is a Dell Latitude 5421, with a Core i5-11500H, 32 GB of RAM, and an Nvidia MX 450 graphics card running in hybrid mode.

PhilipGB commented 2 years ago

Please rub https://github.com/intel/thermal_daemon/blob/master/test/thermal-debug-dump-fedora.sh or https://github.com/intel/thermal_daemon/blob/master/test/thermal-debug-dump-ubuntu.sh. I can check thermal tables first.

24214023.tar.gz

Dell Latitude 7320 i7-1185G7 on Ubuntu 20.04.3 running Kernel 5.14.0-1024-oem

thermald.log

spandruvada commented 2 years ago

@PhilipGB I don't see any throttling done by thermald. Can you try this procedure as a root user

spandruvada commented 2 years ago

If that doesn't address, the next step is also write in addition to above echo 54000000 > /sys/devices/virtual/powercap/intel-rapl-mmio/intel-rapl-mmio\:0/constraint_1_power_limit_uw

PhilipGB commented 2 years ago

Same behaviour. Very briefly clocks to 4.3ghz then settles on 1.8ghz

image

I have observed that if I disable the thermald service then reboot the clock speed will float around 2.3ghz with intermittent drops to 400mhz under load

But if it's booted enabled and stopped or has been run and then stopped then even while thermald isn't running the system behaves the same as if it was, settling on 1.8ghz

spandruvada commented 2 years ago

Why are you using cpufreq performance. Try to use powersave, it will not hurt performance in most of the cases.

spandruvada commented 2 years ago

Also check cat /sys/bus/pci/devices/0000\:00\:04.0/tcc_offset_degree_celsius If this is high number write something like "5"