intel / thermal_daemon

Thermal daemon for IA
GNU General Public License v2.0
540 stars 117 forks source link

Dell Lattitude 5420 (TGL) throttled to 400MHz CPU / 100MHz GPU after 30s #293

Closed majanes-intel closed 2 years ago

majanes-intel commented 3 years ago

Kernel: 5.11.3 Debian: Testing thermald: 2.4.3 (debian unstable) processor: i7-1185G7 -- 28 W TDP

After running power-intensive workloads for a short amount of time, the CPU and/or GPU will be throttled down drastically to ~10% of peak.

Running turbostat reveals that the peak current is ~16W, far below the TDP limit.

Running lm-sensors shows that the peak temp is ~50C, far below the limit.

After reading #291 and #280, I enabled debug logs for thermald. thermald.log

@spandruvada let me know if more information is needed. I can also bring the system to you in JF1. Mesa team will be using this laptop model for perf analysis.

spamik commented 2 years ago

Latitude 7320 with latest BIOS (1.12.2), i7-1185G7 and same issue. Also I have noticed difference between running thermald and rmmod & modprobe of intel_rapl_msr kernel module. With running thermald it's better (without it it's almost 100% hundred time running on 400MHz) but still it few times (under load) downscale to 400MHz and up. If I stopped thermald and did rmmod & modprobe hack it's running 1800MHz all the time.

sebastianha commented 2 years ago

I can confirm the problem with the 7320. I am running also BIOS 1.12.2, i7-1185G7 on Opensuse Tumbleweed with 5.15.5-1-default, thermald 2.4.6 from repo.

Update: Also tested with latest compiled version from repo

sebastianha commented 2 years ago

Should we open a new ticket for Latitude 7320?

sebastianha commented 2 years ago

I just tested temperature control with Windows and even there the Clock speed gets down to 2GHz at 70°C after some time (I installed the Intel dynamic tuning driver and all available updates from Dell)

sebastianha commented 2 years ago

Searching through other forums I did run "dptfxtract" on my system and used this in combination with thermald. Interestingly it generates the following values in "/etc/thermald/thermal-conf.xml.auto":

<PPCC>
        <PowerLimitIndex>0</PowerLimitIndex>
        <PowerLimitMinimum>5000</PowerLimitMinimum>
        <PowerLimitMaximum>13500</PowerLimitMaximum>
        <TimeWindowMinimum>28000</TimeWindowMinimum>
        <TimeWindowMaximum>32000</TimeWindowMaximum>
        <StepSize>100</StepSize>
</PPCC>

When starting thermald I can observer that "/sys/devices/virtual/powercap/intel-rapl-mmio/intel-rapl-mmio:0/constraint_0_power_limit_uw" ist set to "13500000" which is way too low for this CPU. When not running thermald it is set to "15000000" which is also to low. I would expect something > "40000000".

When I run "powercap-info -p intel-rapl" it shows me:

Zone 0
  name: package-0
  enabled: 1
  max_energy_range_uj: 262143328850
  energy_uj: 11108644596
  Constraint 0
    name: long_term
    power_limit_uw: 35000000
    time_window_us: 27983872
    max_power_uw: 28000000
  Constraint 1
    name: short_term
    power_limit_uw: 60000000
    time_window_us: 2440
    max_power_uw: 0
  Constraint 2
    name: peak_power
    power_limit_uw: 105000000
    time_window_us: 0
    max_power_uw: 0

which sounds correct for me.

So now the questions: What does set the limit to "15000000" when not running thermald or how can I get a correct thermald configuration?

My current guess is that somehow the hardware does not expose the correct values to the system and therefore sets the limits too low. Some report that after installing the "Intel Dynamic Tuning Treiber" and rebooting to Linux solves the problem. I only can imaging that this driver corrects the values and when doing a reboot (not shutdown) the values are still persistent.

sebastianha commented 2 years ago

When running thermald with "--adaptive" it does improve a little bit. I notice increasing "constraint_0_power_limit_uw" up to "35000000" which looks fine. Nevertheless the temperature still does not rise above 50°C.

I attached a debug log from thermald, perhaps it helps. thermald.log

Grtschnk commented 2 years ago

I also see a bit of "cycling" when stress testing with modified/higher RAPL values: Under load the power is capped at 10W (and temperatures at 50C), but every 4-5 seconds it jumps up to ~20W only to be capped again a second later. It seems that thermald (or throttled or any other tools) set the correct values, but some other controlling mechanism kicks in a second later. Is it some other driver in the kernel or maybe even the BIOS? (which needs some obscure bit flipped, that only Windows knows about.)

sebastianha commented 2 years ago

From my observations now the main problem is the 50°C barrier combined with the fact that the fan will not spin up as high as it should. With thermald power limit is set correctly (as far if I understand) but somehow the system tries to reach the temperature limit of 50° and prefers to throttle down instead of going up to ~95° and letting the fan run full speed.

benzea commented 2 years ago

It seems that thermald (or throttled or any other tools) set the correct values …

Yeah, thermald has a poll interval of 4 seconds. So that does indeed sound like something else is interfering with the controls.

No real clue here, but maybe a different UUID needs to be set in /sys/bus/acpi/devices/INTC1040:00/physical_node/uuids/? You can check the available UUIDs and set it there, with --adaptive thermald is setting 63BE270F-1C11-48FD-A6F7-3AF253FF3E2D otherwise it'll set 42A441D6-AE6A-462b-A84B-4A8CE79027D3, AFAICT.

sebastianha commented 2 years ago

A little observation: When I start thermald automatically on boot via systemd I get stuck at 400MHz on load. When I restart thermald after booting up completely the new lowest frequency is 1800MHz. There seems to be something interfering.

ColinIanKing commented 2 years ago

Are there other processes accessing the various /sys interfaces that may be twidding thermal controls? One way to check that is using a tool like fnotifystat to see file accesses:

sudo fnotifystat -i /sys -x /sys/fs/cgroup

sebastianha commented 2 years ago

I can only see thermald and ksystemstats (which is a cpu monitoring widget I use, I guess):

Total   Open  Close   Read  Write   PID  Process         Pathname
  6.0    2.0    2.0    2.0    0.0   3515 ksystemstats    /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq
  6.0    2.0    2.0    2.0    0.0   3515 ksystemstats    /sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq
  6.0    2.0    2.0    2.0    0.0   3515 ksystemstats    /sys/devices/system/cpu/cpufreq/policy2/scaling_cur_freq
  6.0    2.0    2.0    2.0    0.0   3515 ksystemstats    /sys/devices/system/cpu/cpufreq/policy3/scaling_cur_freq
  6.0    2.0    2.0    2.0    0.0   3515 ksystemstats    /sys/devices/system/cpu/cpufreq/policy4/scaling_cur_freq
  6.0    2.0    2.0    2.0    0.0   3515 ksystemstats    /sys/devices/system/cpu/cpufreq/policy5/scaling_cur_freq
  6.0    2.0    2.0    2.0    0.0   3515 ksystemstats    /sys/devices/system/cpu/cpufreq/policy6/scaling_cur_freq
  6.0    2.0    2.0    2.0    0.0   3515 ksystemstats    /sys/devices/system/cpu/cpufreq/policy7/scaling_cur_freq
  3.0    1.0    1.0    1.0    0.0  16628 thermald        /sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/energy_uj
  3.0    1.0    1.0    1.0    0.0  16628 thermald        /sys/devices/virtual/thermal/thermal_zone1/temp
  3.0    1.0    1.0    1.0    0.0  16628 thermald        /sys/devices/virtual/thermal/thermal_zone2/temp
  3.0    1.0    1.0    1.0    0.0  16628 thermald        /sys/devices/virtual/thermal/thermal_zone3/temp
  3.0    1.0    1.0    1.0    0.0  16628 thermald        /sys/devices/virtual/thermal/thermal_zone4/temp

After testing this I did stop thermald and interestingly the behaviour of the CPU frequencies did not change. It still goes down to 1800MHz under load. constraint_0_power_limit_uw is unchanged at 35000000

Is there any detailed documentation what thermald does when and which actions it does to control the CPU? Until now I have seen that it manages "constraint_0_power_limit_uw" and raises the limit until it reaches 35000000.

benzea commented 2 years ago

Btw. my implication with setting the UUID to solve the issue is that the firmware is still controlling things.

i.e. setting that UUID flags to the firmware that the OS is doing proper thermal management. So if the UUID means "version X" but the firmware is expecting "version Y" or so, then I can totally see that it isn't giving up thermal control.

btw. you might be able to tell more by looking at the ACPI code of the firmware and checking what happens for various UUIDs on the INTC1040 device.

sebastianha commented 2 years ago

Unfortunately this is beyond my level of knowledge. But I would be happy to provide everything needed to solve this issue!

Grtschnk commented 2 years ago

@benzea Thank you for looking at this and the explanation! I've had a look in the /sys directory. /sys/bus/acpi/devices/INTC1040:00/physical_node/uuids/available_uuids containts UNKNOWN. Is that a possible hint as to what is wrong with our systems? The current_uuid is as you said: 63BE270F-1C11-48FD-A6F7-3AF253FF3E2D

sebastianha commented 2 years ago

@benzea Thank you for looking at this and the explanation! I've had a look in the /sys directory. /sys/bus/acpi/devices/INTC1040:00/physical_node/uuids/available_uuids containts UNKNOWN. Is that a possible hint as to what is wrong with our systems? The current_uuid is as you said: 63BE270F-1C11-48FD-A6F7-3AF253FF3E2D

Same here on Dell 7320

sebastianha commented 2 years ago

Interesting fact: After a fresh reboot "current_uuid" ist set to "INVALID". After starting thermald it shows "63BE270F-1C11-48FD-A6F7-3AF253FF3E2D"

# cat /sys/bus/acpi/devices/INTC1040:00/physical_node/uuids/current_uuid
INVALID
# cat /sys/bus/acpi/devices/INTC1040:00/physical_node/uuids/available_uuids
UNKNOWN
# systemctl start thermald.service 
# cat /sys/bus/acpi/devices/INTC1040:00/physical_node/uuids/current_uuid
63BE270F-1C11-48FD-A6F7-3AF253FF3E2D
sebastianha commented 2 years ago

Is there any way I can prove that thermald has control over the CPU? My current guess is, that thermald sets the "current_uuid" and "constraint_0_power_limit_uw" but has no full control.

When starting thermald I get the following log:

systemd[1]: Starting Thermal Daemon Service...
systemd[1]: Started Thermal Daemon Service.
thermald[11087]: 27 CPUID levels; family:model:stepping 0x6:8c:1 (6:140:1)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported condition 57 (UKNKNOWN)
thermald[11087]: Unsupported conditions are present
thermald[11087]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[11087]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[11087]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[11087]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[11087]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[11087]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[11087]: 27 CPUID levels; family:model:stepping 0x6:8c:1 (6:140:1)
thermald[11087]: Polling mode is enabled: 4
thermald[11087]: sensor id 10 : No temp sysfs for reading raw temp
thermald[11087]: sensor id 10 : No temp sysfs for reading raw temp
thermald[11087]: sensor id 10 : No temp sysfs for reading raw temp
thermald[11087]: ppcc limits is less than def PL1 max power :28000000 check thermal-conf.xml.auto
thermald[11087]: ppcc limits is less than def PL1 max power :28000000 check thermal-conf.xml.auto
thermald[11087]: Unable to find a zone for TSSD

--exclusive-control does not change antyhing.

Is there anything which prevents thermald from working correctly in the log?

saevarb commented 2 years ago

I just got this new work laptop and tracked down this issue. It's a 7420.

Following the advice in this thread, I have installed thermald(2.4.6) and I too can confirm that it will not go higher than 1800mhz under load(except very briefly in the beginning). I've also turned on performance mode in the power settings in the bios, and updated all firmware.

Here is my output starting thermald:

thermald[373]: ppcc limits is less than def PL1 max power :28000000 check thermal-conf.xml.auto
thermald[373]: ppcc limits is less than def PL1 max power :28000000 check thermal-conf.xml.auto
thermald[373]: sensor id 10 : No temp sysfs for reading raw temp
thermald[373]: sensor id 10 : No temp sysfs for reading raw temp
thermald[373]: sensor id 10 : No temp sysfs for reading raw temp
thermald[373]: Polling mode is enabled: 4
thermald[373]: 27 CPUID levels; family:model:stepping 0x6:8c:1 (6:140:1)
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unable to find a sensor for \_SB_.PC00.LPCB.ECDV.NGFF
thermald[373]: Unsupported conditions are present
thermald[373]: Unsupported condition 57 (UKNKNOWN)
thermald[373]: Unsupported condition 57 (UKNKNOWN)
thermald[373]: Unsupported condition 57 (UKNKNOWN)
thermald[373]: Unsupported condition 57 (UKNKNOWN)
thermald[373]: Unsupported condition 57 (UKNKNOWN)
thermald[373]: Unsupported condition 57 (UKNKNOWN)
thermald[373]: Unsupported condition 57 (UKNKNOWN)
thermald[373]: Unsupported condition 57 (UKNKNOWN)
thermald[373]: Unsupported condition 57 (UKNKNOWN)
thermald[373]: Unsupported condition 57 (UKNKNOWN)

More output:

$ cat /sys/bus/acpi/devices/INTC1040:00/physical_node/uuids/current_uuid
63BE270F-1C11-48FD-A6F7-3AF253FF3E2D
$ cat /sys/bus/acpi/devices/INTC1040:00/physical_node/uuids/available_uuids
UNKNOWN

Is it reasonable to wait for a workaround/fix for this, or should I just be returning this laptop?

zamazan4ik commented 2 years ago

@saevarb I suggest return the laptop and pick something non-Dell (like Lenovo ThinkPad)

sebastianha commented 2 years ago

@saevarb Looks exactly like it is on my 7320. Currently I do not know any fix, assumingly Dell has to fix this with a BIOS update.

saevarb commented 2 years ago

After disabling intel_pstate via kernel parameters, I am now using the acpi-cpufreq driver. Unfortunately, now cpupower frequency-info returns a max frequency of 1800mhz:

analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us
  hardware limits: 400 MHz - 1.80 GHz
  available frequency steps:  1.80 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz, 1.10 GHz, 1000 MHz, 900 MHz, 800 MHz, 700 MHz, 600 MHz, 500 MHz, 400 MHz
  available cpufreq governors: conservative ondemand userspace powersave performance schedutil
  current policy: frequency should be within 400 MHz and 1.80 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency: 1.80 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes

So I guess I'm going back to using the intel_pstate driver for now.

It would be super useful if any intel people could tell us where they think the problem is and what is required to fix it. If I need to wait 6 months for a BIOS update, then I need to return this computer. If there is a possible workaround or a shorter timeline for a fix, then I may be willing to wait.

@majanes-intel Is that something you could shed some light on?

errogaht commented 2 years ago

After disabling intel_pstate via kernel parameters, I am now using the acpi-cpufreq driver. Unfortunately, now cpupower frequency-info returns a max frequency of 1800mhz:

analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us
  hardware limits: 400 MHz - 1.80 GHz
  available frequency steps:  1.80 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz, 1.10 GHz, 1000 MHz, 900 MHz, 800 MHz, 700 MHz, 600 MHz, 500 MHz, 400 MHz
  available cpufreq governors: conservative ondemand userspace powersave performance schedutil
  current policy: frequency should be within 400 MHz and 1.80 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency: 1.80 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes

So I guess I'm going back to using the intel_pstate driver for now.

It would be super useful if any intel people could tell us where they think the problem is and what is required to fix it. If I need to wait 6 months for a BIOS update, then I need to return this computer. If there is a possible workaround or a shorter timeline for a fix, then I may be willing to wait.

@majanes-intel Is that something you could shed some light on?

it looks like intel does not care about that, i've returned dell 5421 with i7-11850h so i recommend the same for you while it is possible. I will try to avoid intel CPUs in future because of that bug and fact that seems they does not care.

sebastianha commented 2 years ago

I am also afraid that this is something where Intel only could provide a workaround, a real fix has to be done by Dell.

Here is a forums thread from Lenovo where – as far as I understand – a BIOS update was the solution: https://forums.lenovo.com/t5/Other-Linux-Discussions/X1-Carbon-Gen-9-severe-throttling/m-p/5092264 At least users report that they see temperatures >90° under load. But might be a different story.

benzea commented 2 years ago

As I understand it, the UUID is the protocol that tells the firmware about the OS level thermal management that is running.

Lenovo's approach is to not use OS thermal management but rather implement it inside the firmware.

Either way, I think that what is needed here is feedback from DELL. Or, lacking that, reverse engineering the ACPI code a bit in order to understand what is going on on the platform. As Sebastian was saying, Intel isn't directly at fault here (indirectly yes, they should be opening up the specifications sufficiently instead of having the half working "adaptive" implementation in thermald that was reverse engineered rather than having been developed by Intel).

sebastianha commented 2 years ago

I have a support case open at Dell as the 7320 should be "officially supported" by Ubuntu. I will report back if there happens anything.

saevarb commented 2 years ago

I went ahead and googled the UUID in question, and that lead me to this: https://github.com/mjg59/thermal_daemon

I managed to build it and try to run it, but it uses the wrong hard-coded path:

$ sudo ./thermald --adaptive --no-daemon
[1638975303][WARN]sysfs read failed /sys/bus/platform/devices/INT3400:00/firmware_node/path
[1638975303][ERR]Unable to locate INT3400 firmware path
[1638975303][ERR]THD engine start failed

I went ahead and just changed the hard-coded path everywhere to see if there was a chance it would work, but unfortunately:

$ sudo ./thermald --adaptive --no-daemon
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Unsupported condition 57
[1638975574][ERR]Exiting due to unsupported conditions
[1638975574][ERR]Unable to verify conditions are supported
[1638975574][ERR]THD engine start failed

Is there any documentation to be found on these "conditions"? I don't even know where to begin searching for info on this. Intel processor manuals?

sebastianha commented 2 years ago

Perhaps a noticeable thing I found: Phoronix tested the Dell XPS 13 9310 with a i7 1185G7 and it performed well. So it cannot be a problem of the CPU.

https://www.phoronix.com/scan.php?page=article&item=intel-corei7-1185g7

Does someone have a Dell XPS 13 9310 and can post:

Edit: There are also some posts out there that the XPS has the same throttling problems... So how did Phoronix do this?

Edit2: On the Ubuntu homepage (https://ubuntu.com/certified/202010-28320 for 7420) it says:

"Pre-installed in some regions with a custom Ubuntu image that takes advantage of the system’s hardware features and may include additional software. Standard images of Ubuntu may not work well, or at all."

Does someone know where to get these images and what the changes are? Perhaps there is the "secret" from Phoronix?

spamik commented 2 years ago

Yesterday was released new BIOS for Latitude 7x20 (1.13.0). In release info is just something about memory frequency but still I'll try to update it during the day and I'll see if it change something.

Grtschnk commented 2 years ago

I tried the 1.13 BIOS and 5.15.7 kernel earlier. Still no luck.

On Thu, 9 Dec 2021, 21:48 Jan Krajdl, @.***> wrote:

Yesterday was released new BIOS for Latitude 7x20 (1.13.0). In release info is just something about memory frequency but still I'll try to update it during the day and I'll see if it change something.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/intel/thermal_daemon/issues/293#issuecomment-989636351, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEDBBCZ57LKCI2PHN7JUYHDUQBUNHANCNFSM4YV3FMVQ .

spamik commented 2 years ago

There is some unofficial BIOS repository for Dell? Because I check official support site regularly and yesterday there was still 1.12.2 version.

sebastianha commented 2 years ago

Here is 1.13.0: https://www.dell.com/support/home/de-de/drivers/driversdetails?driverid=73xky&oscode=wt64a&productcode=latitude-13-7320-2-in-1-laptop

I will test it later.

ColinIanKing commented 2 years ago

I've had issues in the past where a BIOS upgrade needed one to do a "reset-to-defaults" in the BIOS setup to get configured correctly. Perhaps this is worth trying just in case there is a mismatch between the old config values and the ones for the new BIOS. I've seen this happen on two brands of laptops that I worked on while doing enablement work for Ubuntu.

spamik commented 2 years ago

Well I updated it right now, did reset to factory defaults and booted system with 5.15.6. It's still not working like it should be but something changed - when I stress system, it remains on 1800MHz but sometimes it scale up to 3GHz. That didn't happened previously. But it's still throttled - temperatures about 50°C.

spamik commented 2 years ago

Another interesting thing now - I tried to stop thermald and run throttled. With throttled it runs almost all time on 3GHz but sometimes downscale to 400MHz (before update it runs on 400MHz with throttled all the time). Well, something changed :-)

VitaliiSerdiuk commented 2 years ago

@spamik Probably because of latest changes in Linux core - Subbaraman Narayanamurthy (1): thermal: Fix NULL pointer dereferences in ofthermal functions

spamik commented 2 years ago

Yeah, but I have same kernel version before BIOS update and with this it was sticked to 1800MHz. But maybe the difference can be i some BIOS settings after factory defaults (now I have enabled speed step and speed shift, previously I think I disabled both).

VitaliiSerdiuk commented 2 years ago

@sebastianha

Perhaps a noticeable thing I found: Phoronix tested the Dell XPS 13 9310 with a i7 1185G7 and it performed well. So it cannot > be a problem of the CPU.

https://www.phoronix.com/scan.php?page=article&item=intel-corei7-1185g7

Does someone have a Dell XPS 13 9310 and can post:

ThermalD logs current_uuid (with and without thermald) available_uuids temperature reading under load constraint_0_power_limit_uw

Edit: There are also some posts out there that the XPS has the same throttling problems... So how did Phoronix do this?

Edit2: On the Ubuntu homepage (https://ubuntu.com/certified/202010-28320 for 7420) it says:

"Pre-installed in some regions with a custom Ubuntu image that takes advantage of the system’s hardware features and may >include additional software. Standard images of Ubuntu may not work well, or at all."

Does someone know where to get these images and what the changes are? Perhaps there is the "secret" from Phoronix?

Try to search http://dell.archive.canonical.com/updates/dists/ or some similar dist

saevarb commented 2 years ago

It looks like ubuntu is using a special OEM kernel, as mentioned here: https://ubuntu.com/certified/202010-28320 (some extra info here https://wiki.ubuntu.com/Kernel/OEMKernel)

It seems fairly likely that the fixes might be in there. The source can be found here: https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-oem/+git/focal?h=oem-5.10

I haven't compiled a kernel from scratch in a good while, but I may take a shot at it this evening.

spamik commented 2 years ago

I read somewhere (not sure where now...) that this issue also affects dell laptopts which already comes preinstalled with Ubuntu... so unless there is some new patch in OEM kernel, I'm afraid that it will have same issue :-/

saevarb commented 2 years ago

I read somewhere (not sure where now...) that this issue also affects dell laptopts which already comes preinstalled with Ubuntu... so unless there is some new patch in OEM kernel, I'm afraid that it will have same issue :-/

Well, if that is the case, then that is bad news and good news. The good news being that Dell is shipping these out to people with preinstalled Linux not having properly vetted that it works, which makes it easier for a consumer to argue for a refund.

sebastianha commented 2 years ago

But then the question is: How did Phoronix managed to do the benchmarks? Also: For the Latitude 7320 I could not find the Linux option on the homepage for a preinstalled image (in Germany).

sebastianha commented 2 years ago

Just tested the BIOS 1.13.0: Nothing changed. Without thermald 400MHz under load With thermald 1800MHz under load

Grtschnk commented 2 years ago
  • current_uuid (with and without thermald)
  • available_uuids

Asked a colleague with an XPS 13 9310, and his output for the UUIDs is the same. (UNKNOWN and 63BE270F-1C11-48FD-A6F7-3AF253FF3E2D)

saevarb commented 2 years ago

@Grtschnk Does he have the same or similar problems?

Grtschnk commented 2 years ago

@Grtschnk Does he have the same or similar problems?

With thermald it's all good, without it throttling/cycling happens.

sebastianha commented 2 years ago

Can he confirm temperatures above 50-60°C? When running a stress test: at what frequency does the cpu settle down?

Grtschnk commented 2 years ago

I don't have the details, but I am assuming it works "normal" i.e. running hot when necessary. I described my issue to him, it sounded like his was not as grave and solved by thermald.

0x501D commented 2 years ago

Product Name: Latitude 5420 BIOS Revision: 1.13 5.15.6-gentoo-x86_64 sys-power/thermald-2.4.6 intel_pstate enabled

under stress test (stress -c 8) temp = ~70C, Freq 2300 Mhz test duration ~ 1hour

sebastianha commented 2 years ago

Better than my 1800MHz but I would expect 3GHz and 95°C as long term values.

Which CPU do you have?