Closed berglh closed 4 years ago
Tried again with temperature set to 85000
and the same thermal shutdown occured. I logged the temperature of CPU and GPU during this run.
temp-gpu.log
On Fri, 2018-08-10 at 16:09 -0700, Berg Lloyd-Haig wrote:
I'm reporting about my thermal shutdown event that seems very similar to the following, and I've tried similar troubleshooting steps: #158. Main Problems: Incorrect performance throttling results in Thermal Shutdown Event (100C on CPU) as reported in BIOS log System fan during system idle powers at full speed for 1 second every minute or so (could also be GPU fan) Here the CPU doesn't seems to be issue (. Your laptop skin temperature is 56C, which can be contributed by Nvidia GPU, which we can't control via CPU package. What type of the form factor? The thermal tables doesn't call for any action till you reach 100C, which can be junk.
I suggest to run https://github.com/intel/dptfxtract
and check the auto generated table and send that. It will also generate some tables in the current working folders, you can attach them also.
Thanks, Srinivas
Environment: Operating Systems: Ubuntu 18.04.1 LTS Thermald Version: 1.7 Ubuntu Thermald Package: 1.7.0-5ubuntu1 Kernel: 4.17.14-041714-generic (Also present with 4.15.0-29, 4.15.0- 30, 4.17.11-041711) Nvidia Driver: 396.51-0ubuntu0~gpu18.04.1 TLP Package: 1.1-2ubuntu1 Details: Initially; running Unigine_Valley I could run with Turbo Boost enabled and not experience any thermal shutdown events. Now under full load with turbo-boost enabled, the laptop exerperiences a thermal shutdown event by the BIOS. I tried to set my limit to 90 degrees C as per #158 and still experienced the thermal shutdown event, here is the debug log: thermald-debug.log Upon starting the daemon, I receive error message regarding reading of the thermal zone trip points: Jul 02 22:10:32 bxps systemd[1]: Starting Thermal Daemon Service... Jul 02 22:10:32 bxps thermald[763]: 22 CPUID levels; family:model:stepping 0x6:9e:a (6:158:10) Jul 02 22:10:32 bxps thermald[763]: Polling mode is enabled: 4 Jul 02 22:10:32 bxps thermald[763]: Using generated /var/run/thermald/thermal-conf.xml.auto Jul 02 22:10:32 bxps thermald[763]: sysfs read failed constraint_0_max_power_uw Jul 02 22:10:32 bxps thermald[763]: sysfs read failed /sys/class/thermal/thermal_zone10/trip_point_0_hyst Jul 02 22:10:32 bxps thermald[763]: sysfs read failed /sys/class/thermal/thermal_zone10/trip_point_1_hyst Jul 02 22:10:32 bxps thermald[763]: sysfs read failed /sys/class/thermal/thermal_zone10/trip_point_2_hyst Jul 02 22:10:32 bxps thermald[763]: sysfs read failed /sys/class/thermal/thermal_zone10/trip_point_3_hyst Jul 02 22:10:32 bxps thermald[763]: sysfs read failed /sys/class/thermal/thermal_zone10/trip_point_4_hyst Jul 02 22:10:32 bxps thermald[763]: sysfs read failed /sys/class/thermal/thermal_zone10/trip_point_5_hyst Jul 02 22:10:32 bxps thermald[763]: sysfs read failed /sys/class/thermal/thermal_zone10/trip_point_6_hyst Catting all the values of all the files with sudo in the thermal_zone folder shows issues with reading the aforementioned files: thermalzonefiles.txt The auto generated XML is as follows, as mentioned I tried to change temperature to 90C 90000 as evident in the debug log: <?xml version="1.0"?>
_TRT export XPS 15 9570 QUIET B0D4 B0D4 * passive SEQUENTIAL rapl_controller 200 1 TSKN TSKN * passive SEQUENTIAL rapl_controller 200 1 TMEM TMEM * passive SEQUENTIAL rapl_controller 200 1 NGFF NGFF * passive SEQUENTIAL rapl_controller 200 1 TVGA TVGA * passive SEQUENTIAL rapl_controller 200 1 I'd appreciate assistance with this issue, on initial OS installation I didn't have this problem; I'm guessing some firmware or package update has broken correct thermal throttling of the CPU. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Thank you @spandruvada.
This is a laptop, the outside case is aluminium. The power MOSFETs are passively cool (no heatsink), GPU and CPU share the same heatpipes it seems.
Looking in the BIOS log, the error it gave was Critical Shutdown temperature, so it doesn't actually indicate which sensor caused the machine to power off as you suggest. In some of the CPU monitoring logs I've produced using sensors, I have seen the CPU hit 100 degrees, but obviously if thermald debug logs are not showing anything at point of shutdown due to CPU temperature, then you must be right it's another sensor getting too hot.
Here is the auto generated table by dptfxtract
:
<?xml version="1.0"?>
<!-- BEGIN -->
<ThermalConfiguration>
<Platform>
<Name> Auto generated </Name>
<ProductName>XPS 15 9570</ProductName>
<Preference>QUIET</Preference>
<ThermalZones>
<ThermalZone>
<Type>auto_zone_0</Type>
<TripPoints>
<TripPoint>
<SensorType>NGFF</SensorType>
<Temperature>0</Temperature>
<Type>Passive</Type>
<CoolingDevice>
<Type>B0D4</Type>
<SamplingPeriod>0</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>auto_zone_1</Type>
<TripPoints>
<TripPoint>
<SensorType>TSKN</SensorType>
<Temperature>0</Temperature>
<Type>Passive</Type>
<CoolingDevice>
<Type>B0D4</Type>
<SamplingPeriod>0</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>auto_zone_2</Type>
<TripPoints>
<TripPoint>
<SensorType>TVGA</SensorType>
<Temperature>0</Temperature>
<Type>Passive</Type>
<CoolingDevice>
<Type>B0D4</Type>
<SamplingPeriod>0</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>auto_zone_3</Type>
<TripPoints>
<TripPoint>
<SensorType>TMEM</SensorType>
<Temperature>0</Temperature>
<Type>Passive</Type>
<CoolingDevice>
<Type>B0D4</Type>
<SamplingPeriod>0</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>auto_zone_4</Type>
<TripPoints>
<TripPoint>
<SensorType>B0D4</SensorType>
<Temperature>0</Temperature>
<Type>Passive</Type>
<CoolingDevice>
<Type>B0D4</Type>
<SamplingPeriod>0</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
</ThermalZones>
</Platform>
</ThermalConfiguration>
<!-- END -->
All files generated using the acpi tools in the dptfxtract as per your request. dptfxtract.zip
This all table contains junk. Let me see if we have this laptop or take help from Dell.
Thanks, Srinivas
From: Berg Lloyd-Haig [mailto:notifications@github.com] Sent: Monday, August 13, 2018 2:14 PM To: intel/thermal_daemon thermal_daemon@noreply.github.com Cc: Pandruvada, Srinivas srinivas.pandruvada@intel.com; Mention mention@noreply.github.com Subject: Re: [intel/thermal_daemon] Dell XPS 15 9570 - Thermal Shutdown Event & Annoying Full-Speed Fan Spin Up (#161)
Thank you @spandruvadahttps://github.com/spandruvada.
This is a laptop, the outside case is aluminium. The power MOSFETs are passively cool (no heatsink), GPU and CPU share the same heatpipes it seems. [xpx_9570]https://camo.githubusercontent.com/fba86b8f80724814310676dc061ed069e24d32a2/68747470733a2f2f7777772e756c747261626f6f6b7265766965772e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031372f30332f64656c6c2d7870732d757067726164652e6a7067
Looking in the BIOS log, the error it gave was Critical Shutdown temperature, so it doesn't actually indicate which sensor caused the machine to power off as you suggest. In some of the CPU monitoring logs I've produced using sensors, I have seen the CPU hit 100 degrees, but obviously if thermald debug logs are not showing anything at point of shutdown due to CPU temperature, then you must be right it's another sensor getting too hot.
Here is the auto generated table by dptfxtract:
<?xml version="1.0"?>
All files generated using the acpi tools in the dptfxtract as per your request. dptfxtract.ziphttps://github.com/intel/thermal_daemon/files/2284732/dptfxtract.zip
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/intel/thermal_daemon/issues/161#issuecomment-412666543, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADoFtqAyrRzEfN5Y_IDOqXZE-HLkWH5Iks5uQewHgaJpZM4V469j.
Till we find a better solution try this:
@spandruvada :pray:
The system did thermal throttle correctly and didn't thermal shutdown. For you rreference, I obviously didn't let the auto-config configure correctly as this was still on QUIET mode. However, it did run and I didn't experience a thermal shutdown event. I missed configuring the TVGA value, however, this didn't appear in the ZONE DUMP BEGIN
of my pervious debug.
I will try this again with the original config and higher temperature thresholds. The clamping is very aggressive and does create considerable lag in the system to try to get the components below the aforementioned threshold.
To note, the 3D benchmark before with Turbo Boost disabled ran with this performance, this limited the clock speed of the CPU to 2900 MHz and also any GPU overclocking as I understand:
FPS: | 73.7 |
---|---|
Score: | 3082 |
Min FPS: | 30.8 |
Max FPS: | 124.8 |
With the following config, it ran with this performance, so the throtteling may be a bit on the aggresive side considering Turbo Boost is enabled. I would expect performance envelope to be the same if not better than the above results:
FPS: | 57.7 |
---|---|
Score: | 2415 |
Min FPS: | 10.7 |
Max FPS: | 116.3 |
<?xml version="1.0"?>
<ThermalConfiguration>
<Platform>
<Name>_TRT export</Name>
<ProductName>XPS 15 9570 </ProductName>
<Preference>QUIET</Preference>
<ThermalZones>
<ThermalZone>
<Type>B0D4</Type>
<TripPoints>
<TripPoint>
<SensorType>B0D4</SensorType>
<Temperature>90000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_powerclamp</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TSKN</Type>
<TripPoints>
<TripPoint>
<SensorType>TSKN</SensorType>
<Temperature>50000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_powerclamp</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TMEM</Type>
<TripPoints>
<TripPoint>
<SensorType>TMEM</SensorType>
<Temperature>50000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_powerclamp</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>NGFF</Type>
<TripPoints>
<TripPoint>
<SensorType>NGFF</SensorType>
<Temperature>50000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_powerclamp</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TVGA</Type>
<TripPoints>
<TripPoint>
<SensorType>TVGA</SensorType>
<Temperature>*</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_powerclamp</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
</ThermalZones>
</Platform>
</ThermalConfiguration>
Here is the debug log from thermald: thermald-debug.log
Here is the temperature and core MHz logging during the benchmark period: temp-gpu.log
Thank you for your help, I hope we can get to the bottom of this.
Looking through the temp-gpu.log, you'll see the CPU still got very hot during this run:
2018-08-15T09:43:49+10:00
Attribute 'GPUAdaptiveClockState' (bxps:0[gpu:0]): 1.
Attribute 'GPUCoreTemp' (bxps:0[gpu:0]): 65.
Attribute 'GPUCurrentClockFreqs' (bxps:0[gpu:0]): 1670,3504.
Attribute 'GPUCurrentClockFreqsString' (bxps:0[gpu:0]): nvclock=1670,
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +99.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +99.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +90.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +77.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +79.0°C (high = +100.0°C, crit = +100.0°C)
Core 4: +70.0°C (high = +100.0°C, crit = +100.0°C)
Core 5: +74.0°C (high = +100.0°C, crit = +100.0°C)
In the very next second, though the temperatures are near 90c:
2018-08-15T09:43:50+10:00
Attribute 'GPUAdaptiveClockState' (bxps:0[gpu:0]): 1.
Attribute 'GPUCoreTemp' (bxps:0[gpu:0]): 65.
Attribute 'GPUCurrentClockFreqs' (bxps:0[gpu:0]): 1670,3504.
Attribute 'GPUCurrentClockFreqsString' (bxps:0[gpu:0]): nvclock=1670,
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +91.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +85.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +78.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +81.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +91.0°C (high = +100.0°C, crit = +100.0°C)
Core 4: +72.0°C (high = +100.0°C, crit = +100.0°C)
Core 5: +78.0°C (high = +100.0°C, crit = +100.0°C)
I think this isn't what is the cause of the large performance hit. After a few seconds of the start of the benchmark, the GPU was running at 1683
MHz, after the package reached 55 degrees, the clock speed was clamped a lot. The package reports a max clock of 1911
from the nvidia driver. I think this might be the larger cause if the clamping for this sensor was set at 50 degrees. Of course, I would need to know what the safe temperature is for the GPU to make a good judgement. Do you have any suggestions?
Perhaps I could disable turbo boost and use the default thermald config and then log the temperature and MHz of the GPU to see what maximum temperature it hits?
I didn't see powerclamp getting used in your log. Make sure that you copy the file to /etc/thermald after edit and check in the log file that your config is getting used. I still see the old config.
Thanks, Srinivas
On Tue, 2018-08-14 at 17:06 -0700, Berg Lloyd-Haig wrote:
@spandruvada 🙏 The system did thermal throttle correctly and didn't thermal shutdown. For you rreference, I obviously didn't let the auto-config configure correctly as this was still on QUIET mode. However, it did run and I didn't experience a thermal shutdown event. I missed configuring the TVGA value, however, this didn't appear in the ZONE DUMP BEGIN of my pervious debug. I will try this again with the original config and higher temperature thresholds. The clamping is very aggressive and does create considerable lag in the system to try to get the components below the aforementioned threshold. To note, the 3D benchmark before with Turbo Boost disabled ran with this performance, this limited the clock speed of the CPU to 2900 MHz and also any GPU overclocking as I understand: FPS: 73.7 Score: 3082 Min FPS: 30.8 Max FPS: 124.8 With the following config, it ran with this performance, so the throtteling may be a bit on the aggresive side considering Turbo Boost is enabled. I would expect performance envelope to be the same if not better than the above results: FPS: 57.7 Score: 2415 Min FPS: 10.7 Max FPS: 116.3 <?xml version="1.0"?>
_TRT export XPS 15 9570 QUIET B0D4 B0D4 90000 passive SEQUENTIAL intel_powerclamp 200 1 TSKN TSKN 50000 passive SEQUENTIAL intel_powerclamp 200 1 TMEM TMEM 50000 passive SEQUENTIAL intel_powerclamp 200 1 NGFF NGFF 50000 passive SEQUENTIAL intel_powerclamp 200 1 TVGA TVGA * passive SEQUENTIAL intel_powerclamp 200 1 Here is the debug log from thermald: thermald-debug.log Here is the temperature and core MHz logging during the benchmark period: temp-gpu.log Thank you for your help, I hope we can get to the bottom of this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Sorry, I see new config in the Dumping parsed XML Data
; is there another section in the log?
I edited the auto generated config and put in the intel_powerclamp settings and processor limits. Did I do this incorrectly. I'll try as you suggest.
No matter what I do, this build never seems to attemp to read /etc/thermald/thermal-conf.xml
file. Looking at the man page for this file, it also doesn't list intel_powerclamp
as an option:
thermal-conf.xml(5) File Formats Manual thermal-conf.xml(5)
NAME
thermal-conf.xml - Configuration file for thermal daemon
SYNOPSIS
$(TDCONFDIR)/etc/thermald/thermal-conf.xml
...
A cooling device can be either active or passive. An example of an active device is a FAN, which will not reduce performance at the cost of consuming more power and noise. A passive device uses per‐
formance throttling to control temperature. In addition to cooling devices present in the thermal sysfs, the following cooling devices are built into the thermald, which can be used as valid cooling
device type:
- rapl_controller
- intel_pstate
- cpufreq
- LCD
However, in /etc/thermald/thermal-cpu-cdev-order.xml
it was listed, I've just updated the preference order now:
<!--
Specifies the order of compensation to cool CPU only.
There is a default already implemented in the code, but
this file can be used to change order
The Following cooling device can present
-->
<CoolingDeviceOrder>
<!-- Specify Cooling device order -->
<CoolingDevice>intel_powerclamp</CoolingDevice>
<CoolingDevice>rapl_controller</CoolingDevice>
<CoolingDevice>intel_pstate</CoolingDevice>
<!-- <CoolingDevice>intel_powerclamp</CoolingDevice> -->
<CoolingDevice>cpufreq</CoolingDevice>
<CoolingDevice>Processor</CoolingDevice>
</CoolingDeviceOrder>
I think you are working on the latest version from Ubuntu. You can copy the modified thermal-conf.xml.auto to /etc/thermald It will print which file path is using.
The contents can be.
<?xml version="1.0"?>
Looks like the file contents are distorted.
Hi @spandruvada, I am pretty certain there must be a bug witht his build. Neither /etc/thermald/thermal-conf.xml.auto
or /etc/thermald/thermal-conf.xml
is loaded by the daemon.
I can replace the /var/run/thermald/thermal-conf.xml.auto
and it will use the values I specified there.
After extracting the config you kindly supplied to /etc/thermald/thermal-conf.xml
, this is the journalctl
output. This is the same even if I remove the /var/run/thermald/thermal-conf.xml.auto
file.
Aug 16 06:34:29 bxps systemd[1]: Starting Thermal Daemon Service...
Aug 16 06:34:29 bxps systemd[1]: Started Thermal Daemon Service.
Aug 16 06:34:29 bxps thermald[3509]: 22 CPUID levels; family:model:stepping 0x6:9e:a (6:158:10)
Aug 16 06:34:29 bxps thermald[3509]: Polling mode is enabled: 4
Aug 16 06:34:29 bxps thermald[3509]: Using generated /var/run/thermald/thermal-conf.xml.auto
Aug 16 06:34:29 bxps thermald[3509]: sysfs read failed constraint_0_max_power_uw
Aug 16 06:34:29 bxps thermald[3509]: XML zone: invalid sensor type TVGA
Aug 16 06:34:29 bxps thermald[3509]: Zone update failed: unable to bind
The binary running in debug mode lets us specify the path to the config file, so I try debug mode with the config file option specified: sudo thermald --no-daemon --loglevel=debug --config-file=/etc/thermald/thermal-conf.xml.auto
thermald-debug-spandruvada.log
So, I ran the benchmark with thermald
debug as you supplied and received similar performance results to the last time it completed successfully. There is a major drop in performance with the thermal throttling compared with running the processor without Turbo Boost enabled. I had no issues with Turbo Boost on the Dell XPS 9560. Of course, this is a much larger 6 core package with a higher TDP so I don't find it surprising that thermal throttling would occur. I can only imagine the cooling solution is not enough to keep the TSKN and TMEM temperatures from reaching 50 degrees plus when the processor running at 90~99C and the GPU running at 75C for a short amount of time - it seems understandable. Of course; you're the expert so I trust your analysis of the debug logs implicilty and thank-you again for your help.
FPS: | 56.8 |
---|---|
Score: | 2375 |
Min FPS: | 11.9 |
Max FPS: | 113.6 |
Here is the thermald debug log for this run: thermald-debug-spandruvada.log
Here is my temperature logging using sensors and nvidia-settings: temp-gpu.log
Looks like you have older version. But you can always use --config-file to force. I see from logs that throttling action is happening. I suggest increase the temperature limit of CPU to 95000 and other few degrees up and use --poll-interval=1 in the command line and check.
I'm running the latest repo version for Bionic 1.7.0-5ubuntu1
. I noticed the releases on this github repo is at 1.7.2
though, and the Cosmic experimental build is only 1.7.0-8ubuntu1
. Not sure if it's worth building the latest version off github.
I'll try out your recommendations, although my suspicion is any time the CPU or GPU get that hot, is when the performance throttling kicks in and really stalls performance to get TSKN/TMEM back into an acceptable range.
I will try disabling Turbo Boost and seeing how hot all the components get during that run. Maybe I should work my way up from that baseline in terms of temperature throttling? Might be easier to keep it all under control if I'm throtteling earlier and less agressively.
Ok, some interesting results from disabling turbo boost and running debug mode on thermald and logging temperatures. I guess in general day-to-day loads I'd like to be running in turbo-boost mode but in sustained performance modes going back to balanced-performance without turboboost seems to make sense. I'm going to enable turbo-boost again and do testing at similar conservative max temperatures in thermald.
Benchmark
Metric | Value |
---|---|
FPS: | 72.1 |
Score: | 3017 |
Min FPS: | 29.1 |
Max FPS: | 123.8 |
Thermal Daemon Temps
Sensor | Max Value |
---|---|
NGFF | 36000 |
TSKN | 51000 |
TMEM | 49000 |
BD04 | 73000 |
Sensors/Nvidia Temps
Device | Max Tem |
---|---|
Core | 76.0 |
GPU | 67.0 |
Frequency
Device | Max Frequency |
---|---|
CPU | 2903.125 |
GPU | 1683 |
GPU Mem | 3504 |
So I've run the benchmark using thermald debug with polling interval set at 1.
sudo thermald --no-daemon --loglevel=debug --poll-interval=1 --config-file=/etc/thermald/thermal-conf.xml.auto
The config is set to similar max temps experienced in no-turboboost mode:
<?xml version="1.0"?>
<ThermalConfiguration>
<Platform>
<Name>_TRT export</Name>
<ProductName>XPS 15 9570</ProductName>
<Preference>QUIET</Preference>
<ThermalZones>
<ThermalZone>
<Type>B0D4</Type>
<TripPoints>
<TripPoint>
<SensorType>B0D4</SensorType>
<Temperature>75000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_powerclamp</type>
<influence>300</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TSKN</Type>
<TripPoints>
<TripPoint>
<SensorType>TSKN</SensorType>
<Temperature>52000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_powerclamp</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TMEM</Type>
<TripPoints>
<TripPoint>
<SensorType>TMEM</SensorType>
<Temperature>50000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_powerclamp</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>NGFF</Type>
<TripPoints>
<TripPoint>
<SensorType>NGFF</SensorType>
<Temperature>40000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_powerclamp</type>
<influence>100</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
</ThermalZones>
</Platform>
</ThermalConfiguration>
Some of the problems with this are that the CPU still hits a temperature max of 96 degrees. When thermald throttle the CPU, it drops the MHz to the lowest setting of 800 MHz
:
cpu MHz : 4291.144
cpu MHz : 3543.074
cpu MHz : 3438.912
cpu MHz : 3595.453
cpu MHz : 3534.619
cpu MHz : 3593.113
cpu MHz : 3449.394
cpu MHz : 3430.216
cpu MHz : 3561.871
cpu MHz : 3385.513
cpu MHz : 3559.620
cpu MHz : 3453.993
cpu MHz : 3585.697
cpu MHz : 800.041
cpu MHz : 800.004
cpu MHz : 800.034
cpu MHz : 800.043
cpu MHz : 800.007
cpu MHz : 800.009
cpu MHz : 800.007
cpu MHz : 800.040
cpu MHz : 800.036
cpu MHz : 800.063
cpu MHz : 800.026
cpu MHz : 800.038
cpu MHz : 4344.477
cpu MHz : 4230.772
cpu MHz : 3914.378
cpu MHz : 4258.803
cpu MHz : 3749.746
cpu MHz : 3673.201
cpu MHz : 3982.862
There are lots of examples of this during the run where the throttling range is between 800,1200,2200,2800 MHz. I'm only sampling this every second so I might be missing other temporary reductions in clock. In the no-turboboost run, the clockspeed can stay at 2900 MHz consistently during the benchmark. While the boosting is going well over 4000 MHz consistently during the benchmark, the constant throttling back to 2000 or below really slows the average system performance down. Of course some of this will be the observer effect, but considering I did the same debugging/logging in no-turboboost mode, this can be thought of as a one to one relationship in my previous comment.
We know when the TSKN or TMEM temperature approaches 55 to 58 degrees, the machine shuts down for thermal protection. I'm not faulting the Intel or Nvidia hardware at all for their operation, as this doesn't appear to be the cause of the shutdown, but the active cooling in the Dell laptop is insufficient to support the thermal safeguard protection for the TSKN/TMEM sensors when operating at max performance mode.
I'm not sure this is the job of thermald to solve. My thought would be that when a consistent high load is detected that rather than trying to obtain the highest turbo-boost everytime the temperature threshold is met, it should understand that aggressive throttling will hurt average system performance, so the max frequency should be capped to an equilibrium point where the CPU temperature isn't exceeding the thermal limits so agressively requiring the CPU to clock back to 800 MHz just to return the package back to a particular desired temperature.
Day to day computing is a lot more sporatic in nature; and this is when turbo-boost is so nice; short bursts of computing performing at very fast speed. There is no problem in this area. Certainly, dealing with turbo boost under consistent load and with the ability of Speed Shift allowing the software layer to manage it; I wonder if Dell are pushing the responsibility of thermal management back to OS. I'm unsure if this should be fixed in firmware/BIOS or an improvement in thermald.
I can go down the path of getting thermal pads for the MOSFETs, improving the thermal compound on the CPU/GPU packages as well as undervolting that might result in the TSKN and TMEM maintaining acceptable temperatures in full turbo-boost mode, but this feels to me like I'm trying to fix a problem with Dell's active cooling design than having them address in firmware the design choices.
I was turned back by Dell support because of the fact I'm running Ubuntu and not the operating system that was bought with the system. There's other Windows users experiencing the same problem and they seem to fall back to running the Dell Power Management Software in which they set the system to run without Turbo Boost enabled during gaming loads.
Anyway, I'm not sure there is much more I can do. I can do as you suggest and keep raising the temperature limits, but we know eventually the TSKN/TMEM will result in thermal shutdown. And even in these senarios, the aggressive changes in clock affects system performance so badly it's not really an option to use turbo-boost in sustained workloads which as a customer of Dell high-end components is disappointing.
Metric | Value |
---|---|
FPS: | 49.0 |
Score: | 2051 |
Min FPS: | 13.0 |
Max FPS: | 115.9 |
Sensor | Max Value |
---|---|
NGFF | 39000 |
TSKN | 51000 |
TMEM | 49000 |
BD04 | 96000 |
Thermald debug log: thermald-debug-turboboost-low-temps-poll-1.log
CPU/GPU sensor logs: temp-gpu.log
I'm currently experimenting with controlling the max_pct
feature of the intel_pstate
driver. By doing so I can limit the max turbo boost frequency which should limit heat transfer to the other components. Keeping the CPU at a lower max temperature should hopefully maintain stability.
For instance; I ran max_pct
at 64
without thermald running and that limited my CPU freq to 3100 MHz. In doing so limited my CPU package temp to 83 degrees.
I think the best I can do at this stage is to come up with a sustained performance profile and use thermald as a safety resort to thermal throttle the system should the TSKN/TMEM get too hot. I'm going to experiement with the max_pct
and these sensor values and try to find something consistently maintainable.
You can try to replace intel_powerclamp with intel_pstate in the xml config. This will also remove turbo dynamically.
Thanks, Srinivas On Fri, 2018-08-17 at 21:11 -0700, Berg Lloyd-Haig wrote:
I'm currently experimenting with controlling the max_pct feature of the intel_pstate driver. By doing so I can limit the max turbo boost frequency which should limit heat transfer to the other components. Keeping the CPU at a lower max temperature should hopefully maintain stability. For instance; I ran max_pct at 64 without thermald running and that limited my CPU freq to 3100 MHz. In doing so limited my CPU package temp to 83 degrees. I think the best I can do at this stage is to come up with a sustained performance profile and use thermald as a safety resort to thermal throttle the system should the TSKN/TMEM get too hot. I'm going to experiement with the max_pct and these sensor values and try to find something consistently maintainable. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@spandruvada This seems to have done the trick.
A couple of sidenotes here in environmental changes before todays test:
I will not have as much time today to eliminate some of these as causes, but I'll continue testing over the coming days to test with no additional cooling fan, reverting kernel, reverting the undervolt and returning to the intel_powerclamp
cooling device.
With the intel_pstate
as the cooling device is opearting in a much less agressive fashion, during the benchmark the throtteling preferred going back to the base max clock or just slightly below
cpu MHz : 3323.989
cpu MHz : 3686.501
cpu MHz : 3111.790
cpu MHz : 3226.374
cpu MHz : 4416.728
cpu MHz : 4129.172
cpu MHz : 4486.201
cpu MHz : 4459.295
cpu MHz : 4380.261
cpu MHz : 4379.540
cpu MHz : 4474.392
cpu MHz : 4313.231
cpu MHz : 4348.753
cpu MHz : 4494.028
cpu MHz : 4141.291
cpu MHz : 4479.671
cpu MHz : 2184.970
cpu MHz : 2342.178
cpu MHz : 3997.233
cpu MHz : 1649.265
cpu MHz : 3600.360
cpu MHz : 3220.999
cpu MHz : 4390.139
cpu MHz : 4448.410
cpu MHz : 2851.323
cpu MHz : 4490.509
cpu MHz : 4294.678
cpu MHz : 3259.228
cpu MHz : 4101.155
cpu MHz : 4100.041
The benchmark scores are are actually 1% better than no-turboboost mode, so this is definitely moving in the right direction. This is still about 1.5% less than me limiting the clock frequency at 3400 MHz and disabling thermald all together, but I'm feeling happy about this achievement of thermald and I can obviously tune this further without such aggressive power throttling.
Metric | Value |
---|---|
FPS: | 73.0 |
Score: | 3055 |
Min FPS: | 31.8 |
Max FPS: | 127.3 |
The thermald config I used:
<?xml version="1.0"?>
<ThermalConfiguration>
<Platform>
<Name>_TRT export</Name>
<ProductName>XPS 15 9570</ProductName>
<Preference>QUIET</Preference>
<ThermalZones>
<ThermalZone>
<Type>B0D4</Type>
<TripPoints>
<TripPoint>
<SensorType>B0D4</SensorType>
<Temperature>85000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_pstate</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TSKN</Type>
<TripPoints>
<TripPoint>
<SensorType>TSKN</SensorType>
<Temperature>55000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_pstate</type>
<influence>300</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TMEM</Type>
<TripPoints>
<TripPoint>
<SensorType>TMEM</SensorType>
<Temperature>55000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_pstate</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>NGFF</Type>
<TripPoints>
<TripPoint>
<SensorType>NGFF</SensorType>
<Temperature>40000</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>intel_pstate</type>
<influence>100</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
</ThermalZones>
</Platform>
</ThermalConfiguration>
The thermald debug log: thermald-debug-turboboost-uv125-pstate-thermald85.log Temperature and clock logging: temp-gpu.log
@spandruvada You told me the ACPI information from the XPS was junk when I exported it for you last time. I was diagnosing a problem with the laptop GPU powering back on after suspend and found the following kernel options worked for me:
GRUB_CMDLINE_LINUX="acpi_osi=! acpi_osi='Windows 2009'"
This resulted in correct powering on. So I then started to wonder, maybe I should acpidump for you again, here is the result with the above kernel options: acpidump.txt
I let thermald recreate the auto-config with this kernel parameter, and the config still seems to the same as the default as before, I'll continue using the above config for now while I keep testing the thermal stability.
<?xml version="1.0"?>
<ThermalConfiguration>
<Platform>
<Name>_TRT export</Name>
<ProductName>XPS 15 9570 </ProductName>
<Preference>QUIET</Preference>
<ThermalZones>
<ThermalZone>
<Type>B0D4</Type>
<TripPoints>
<TripPoint>
<SensorType>B0D4</SensorType>
<Temperature>*</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>rapl_controller</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TSKN</Type>
<TripPoints>
<TripPoint>
<SensorType>TSKN</SensorType>
<Temperature>*</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>rapl_controller</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TMEM</Type>
<TripPoints>
<TripPoint>
<SensorType>TMEM</SensorType>
<Temperature>*</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>rapl_controller</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>NGFF</Type>
<TripPoints>
<TripPoint>
<SensorType>NGFF</SensorType>
<Temperature>*</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>rapl_controller</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
<ThermalZone>
<Type>TVGA</Type>
<TripPoints>
<TripPoint>
<SensorType>TVGA</SensorType>
<Temperature>*</Temperature>
<type>passive</type>
<ControlType>SEQUENTIAL</ControlType>
<CoolingDevice>
<type>rapl_controller</type>
<influence>200</influence>
<SamplingPeriod>1</SamplingPeriod>
</CoolingDevice>
</TripPoint>
</TripPoints>
</ThermalZone>
</ThermalZones>
</Platform>
</ThermalConfiguration>
We have to check also data in sysfs by using the following command:
/sys/class/thermal On Thu, 2018-08-23 at 06:05 -0700, Berg Lloyd-Haig wrote:
@spandruvada You told me the ACPI information from the XPS was junk when I exported it for you last time. I was diagnosing a problem with the laptop GPU powering back on after suspend and found the following kernel options worked for me: GRUB_CMDLINE_LINUX="acpi_osi=! acpi_osi='Windows 2009'" This resulted in correct powering on. So I then started to wonder, maybe I should acpidump for you again, here is the result with the above kernel options: acpidump.txt I let thermald recreate the auto-config with this kernel parameter, and the config still seems to the same as the default as before, I'll continue using the above config for now while I keep testing the thermal stability. <?xml version="1.0"?>
_TRT export XPS 15 9570 QUIET B0D4 B0D4 * passive SEQUENTIAL rapl_controller 200 1 TSKN TSKN * passive SEQUENTIAL rapl_controller 200 1 TMEM TMEM * passive SEQUENTIAL rapl_controller 200 1 NGFF NGFF * passive SEQUENTIAL rapl_controller 200 1 TVGA TVGA * passive SEQUENTIAL rapl_controller 200 1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
So after doing some more investigation, it seems that thermal throtteling and shutdown events have affected some XPS owners. Looking at the analysis of the following URL, it appears the hot spot analysis on the mainboard seemed to indicate very high temperatures of the MOSFETs:
https://www.ultrabookreview.com/14875-fix-throttling-xps-15/
In this image (courtesy of iunlock of Notebookreview), the MOSFETs and chokes have been identified with their temperatures under load. Throttling occurs around 78C.
So I went ahead and bought some thermal pads and added them onto the MOSFET chips. Even with thermald disabled, and turboboost enabled, I am no longer getting any thermal shutdowns.
Setting the kernel parameters to GRUB_CMDLINE_LINUX="acpi_osi=! acpi_osi='Windows 2009'"
resulted in loss of brightness control of the display and some trackpad issues, so I've reverted back to very standards options here.
As mentioned I had previously undervolted the Intel CPU to further reduce thermal load and help increase battery life: undervolt --gpu -75 --core -125 --cache -125 --uncore -125 --analogio -125
using https://github.com/georgewhewell/undervolt
I just had a firmware update today to BIOS 1.4.1. I noticed there was an update to the EC firmware as apart of this update, so here is the acpidump
for this version:
acpidump-v1.4.1.txt
Here is the thermal zone information using the grep command: sys-class-thermal-v1.4.1.txt
When letting thermald
detect the configuration automatically, we're still getting the rapl_controller
type with no limits on the sensors. Basically the same configuration that we started with on this thread.
I'm at a point where I'm no longer experiencing the thermal shutdowns now, so I think I can call this case closed and that there is an inherent thermal hardware design issue with this laptop which can be improved with the thermal pads.
In terms of thermald
. I'm still running the configuration we agreed on earlier and it still operates correctly to keep system temperatures in check. Using the intel_pstate
device type performed the best. It seems like Dell are using non-standard addressing on some of their sensors. This configuration would work to prevent thermal shutdowns, however the pstate driver does tend to limit the performance of the system and the retardation in performance is noticable in a gaming situation.
I can not discover and see the RPM of the system fans using the lm-sensors
package. My suspision is that there is some Dell driver in Windows that enables the OS to see this and we just don't have that available to us in Linux as of yet. Dell have supported Ubuntu on other XPS and Precision platforms in the past, and I'm unsure if they added this support to those models, so unless Dell get back to you about this, I'm unsure if we can ever get thermald
correctly detecting all sensors in the system correctly.
It would actually be great if there was some way to have a fan control under Windows too... It cant be, that my 9570 turns on the stupid fans at 45°c and then runs for 2 mins and the temps are 41°c... I hate this laptop.
I just posted this on reddit a few hours ago, describes how I got my sensors working on a Dell XPS 9550 https://www.reddit.com/r/Dell/comments/9pdgid/configuring_the_xps_to_play_nice_with_linux/
@tonylambiris Cheers for the comment, great to get the sensor data coming in.
@berglh I was actually quite surprised to even get stats on the GPU fan!
please guys i try to upgrade my dell xps 9570 bios and it fail the laptop shutdown and never turn on again can any one help me with bios bin file so that i can reprogram the bios
I'm reporting about my thermal shutdown event that seems very similar to the following issue. I've tried similar troubleshooting steps provided by @spandruvada : https://github.com/intel/thermal_daemon/issues/158.
Main Problems:
Environment:
Dell BIOS: 1.3.0 (Also present in 1.2.0, 1.1.3) CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz Operating Systems: Ubuntu 18.04.1 LTS Thermald Version: 1.7 Ubuntu Thermald Package: 1.7.0-5ubuntu1 Kernel: 4.17.14-041714-generic (Also present with 4.15.0-29, 4.15.0-30, 4.17.11-041711) Nvidia Driver: 396.51-0ubuntu0~gpu18.04.1 TLP Package: 1.1-2ubuntu1
Details:
Initially; running Unigine_Valley I could run with Turbo Boost enabled and not experience any thermal shutdown events. Now under full load with turbo-boost enabled, the laptop exerperiences a thermal shutdown event by the BIOS. I tried to set my limit to 90 degrees C as per #158 and still experienced the thermal shutdown event, here is the debug log: thermald-debug.log
Trying to set dbus parameters as per #158 gives me the following error:
Upon starting the daemon, I receive error message regarding reading of the thermal zone trip points:
Catting all the values of all the files with sudo in the thermal_zone folder shows issues with reading the aforementioned files: thermalzonefiles.txt
The auto generated XML is as follows, as mentioned I tried to change temperature to 90C
90000
as evident in the debug log:I'd appreciate assistance with this issue, on initial OS installation I didn't have this problem; I'm guessing some firmware or package update has broken correct thermal throttling of the CPU.