system76 / firmware-open

System76 Open Firmware
Other
944 stars 84 forks source link

Oryp7 unhealthy jumpy thermals during gaming #224

Open Raikiri opened 2 years ago

Raikiri commented 2 years ago

I'm running Windows 10 with Libre Hardware Monitor to track temps of my CPU and GPU. I noticed that during gaming CPU thermals jump wildly from 60C to as high as 90C sometimes multiple times during 10 seconds.

Usually it happens when stuff is happening in a game that suddenly spikes temperature to 85C+ in a matter of seconds while the fans are practically idling. After a second fans start blasting like nobody's business until temp drops to below 60C in 3 seconds or so. Then they practically turn off again thinking that their job is done and the cycle obviously repeats.

I have two issues here: 1) Sometimes I see temperature jump in Libre Monitor to 85C and the fans are still idling, sometimes requiring more than a couple seconds to "react". Well first, how is it even possible that the temperature changes so quickly? But if it does, I think there should be no smoothing applied to the thermal curves and the fans should be blasting full speed before a meatbag like myself can even notice it in a 3rd party temp monitor program.

2) When I'm gaming, I want my lowest RPM to be at the minimal level that sustains acceptable temperature long term not just momentarily. For example, if a temperature momentarily drops below 60C, it does not mean that the fans should be turned off, if during the last half a minute they were spinning at 3k RMP and the temperature was 65C. What i'm saying is, it should consider some sort of moving average rather than reading momentary temperature and then trying to smooth the result (which I believe it currently does not even attempt).

PS the issue mostly affects CPU temperature, because GPU temperature seems to be way less jumpy in comparison. It usually takes at least 5-10 seconds for the GPU to reach a high temperature and the cooling system has enough time to react, but CPU jumps to very high temperature very quickly.

jacobgkau commented 2 years ago

Have you tested with self-built firmware from this repository, or are you on the currently published firmware? There have already been improvements made to cooling behavior since the current firmware was published, including fan speed ramp-up/ramp-down (with a corresponding decrease in reaction time) and syncing the CPU and GPU fans together since the heatsinks are connected. If you don't want to build and flash firmware yourself, these improvements will be part of an upcoming regular firmware update that is currently being tested.

Raikiri commented 2 years ago

I have not flashed the firmware myself as I'm waiting for an official update, so I'm running the current "stable" version.

Yes, I'm well aware that you guys smoothed the cooling curves, but I'm not sure if it will fix this problem. Here's my reasoning:

1) Smoothing ramp up/ramp down curves can result in even slower ramp-up reaction than there is right now. And as I mentioned, sometimes I see the temperature reaching 90C in 3rd party monitor before the fans even start spinning. In this case, it's more of a delay issue rather than smoothing (so, the opposite). This delay can be up to 1-3 seconds which can pose a significant problem for a system under sudden stress load.

2) I think smoothing can produce less stable results than running average. For example, if there was a huge temp spike for a short time, smoothing it will produce a strong blast that will gradually decay to 0 after the temperature normalizes. But it also means that when the next spike hits, the fans will have potentially turned off at that point. In contrast, if it was using a running average (or running maximum) of the temperature for the last, say, 10 seconds window, then a spike would produce a steady high-rate rpm for the fan so that if the next spike hits during these 10 seconds, cooling system will be "prepared" for it already.

jacobgkau commented 2 years ago

Smoothing ramp up/ramp down curves can result in even slower ramp-up reaction than there is right now.

This is why the reaction time was reduced, as I mentioned. The delay you currently see was to prevent short bursts of high fan speed. That's no longer as much of an issue with smoothing, so the fans will respond slightly quicker.

A 10-second "running maximum" would imply a 10-second delay in response for decreases in temperature. Using an average instead of the current temperature would also add more delay, as it would take longer for the average to rise/fall than the actual temperature.

The smoothing + slight remaining delay created something that felt like an average last time I tested it, so I'd recommend you wait and try it out. The upcoming firmware update is for all Open Firmware laptops and will still be in testing for a little while, so if you want to give it a try now, you can install Rust using the command on this website and then build/flash on your Oryx Pro using these commands:

git clone https://github.com/system76/firmware-open
./scripts/update.sh
./scripts/deps.sh
./scripts/build.sh oryp7
./scripts/flash.sh oryp7

The flashing script will power off your machine, so save any work you have open before running it. As long as you remain plugged into the charger through this entire process, it should be fairly low-risk. Once you're on the self-built firmware, your Oryx Pro will prompt you for a firmware "update," which you can install at any time through the GUI to go back to the regular published firmware.

If you find that the fan behavior is still not satisfactory, this issue should probably be transferred to the EC repository, since that is where most of the work on fan behavior happens. There's also some discussion about fan curves here: https://github.com/system76/ec/issues/180

Raikiri commented 2 years ago

A 10-second "running maximum" would imply a 10-second delay in response for decreases in temperature

Maybe I didn't explain it clearly, but a 10-second running maximum keeps track of maximum temperature that was encountered during the last 10 seconds. So if you suddenly get a spike in temperature, then this running maximum will immediately assume this value and it will last for the next 10 seconds unless another spike occurs.

I think I will after all give the master version of the firmware a try, thanks for the guide.

Raikiri commented 2 years ago

So I ran the new firmware. I did not expect it to take 3GB of download, half an hour of build time, and the need to switch to linux to build it, but that's beside the point. I definitely noticed that it has fan curves that the majority of users will see as a huge improvement. However, my problem still persists: the temps still do hit 90-95C in very sudden spikes and fans take 2-3 seconds to spin up when that happens.

But after thinking some more about it, I came to a conclusion that the problem seems to be more in thermal capacity of the CPU heat sink rather than software that controls the fans. I think temperature should not be able to spike from 60 to 95 in literally one second. I've never seen this problem in any other laptop, because typically heat dissipation is the main problem for most models when fans are just not strong enough and can't dissipate enough thermal energy. But it's definitely not the case here: when fans spin, they do keep the temp impressively low while producing impressively little noise. It seems to be specifically an issue of low thermal capacity and specifically of the CPU heatsink.

curiousercreative commented 2 years ago

FWIW, I observe similar spikes in CPU temperature on 2015 15" MacBook Pro, so it's far from unique to these Clevo laptops. Probably more of an Intel + lighter laptop thing (most laptops).

curiousercreative commented 2 years ago

Can this issue be closed?

Raikiri commented 2 years ago

So I ran the new firmware a fair bit. The problem still reproduces but in a different behaviour. Basically the thermals during gaming stay around 80-85C most of the time, which is I guess acceptable for games that don't support framerate limiting. But what happens occasionally is that temp very quickly and very unstably jumps to 90-95 and then goes back in literally a second.

I don't think it's reasonable to react to this by adjusting the fan speed when this happens, it just happens way too fast. Instead, I'd prefer my fans to just overcool my CPU when I'm gaming so that (roughly speaking) the temp jumps from 75 to 85 instead of jumping from 85 to 95. But there's no way currently to tell that my desired temp is actually 75 and I don't mind my coolers working harder when I'm gaming in anticipation of spikes rather than trying to react to those spikes.

jackpot51 commented 2 years ago

@Raikiri I wonder if maybe the thermal paste was not applied correctly in the factory.

Raikiri commented 2 years ago

@jackpot51 Usually when the thermal paste is applied poorly, heat transfer is inefficient, air blasted out of the fan is lukewarm and the temperature of the core is very high, I actually had that happen on an MSI and that was fixed after repasting. In my case with s76, vented air is properly hot and the core temp drops quickly when fans are doing the job (does not happen with poor thermal interface), but feels like thermal capacity of the heatsink is too low.

But maybe it's not the reason and I'm wrong. I can attempt to repaste my laptop, but I never tried it before, so will need to do plenty of research to make sure I don't mess anything up.

leviport commented 2 years ago

@Raikiri If you haven't found it already, the oryp7 tech-docs may be helpful: https://tech-docs.system76.com/models/oryp7/repairs.html#replacing-the-cooling-system

curiousercreative commented 2 years ago

@Raikiri not much to be done on the heatsink capacity, but you can easily modify your fan curve in firmware to hit 100% fans at 80C for example. Won't prevent temperature spikes of course, but should help if you don't mind the noise. This is the fan curve I run: https://github.com/curiousercreative/ec/blob/galp5/src/board/system76/galp5/board.mk#L44. Breakpoints are to match thermal targets set by system76-power profiles, which probably aren't available in Windows.

Raikiri commented 2 years ago

Here's an example of a typical temp spike on the CPU (x axis is time in seconds, y axis is temp in C): image I marked the point where I start compilation of some project, CPU starting temp is 65C. You can see how the temp spikes from 65C to 90C in about 0.2s, then the fan kicks in after another 0.2s and then the temp stabilizes at 75C.

So everything happens very quickly, there's no time for the fans to react to that, and in the same manner CPU can sometimes briefly reach 95C easily.

Localacct21 commented 2 years ago

Smoothing ramp up/ramp down curves can result in even slower ramp-up reaction than there is right now.

This is why the reaction time was reduced, as I mentioned. The delay you currently see was to prevent short bursts of high fan speed. That's no longer as much of an issue with smoothing, so the fans will respond slightly quicker.

A 10-second "running maximum" would imply a 10-second delay in response for decreases in temperature. Using an average instead of the current temperature would also add more delay, as it would take longer for the average to rise/fall than the actual temperature.

The smoothing + slight remaining delay created something that felt like an average last time I tested it, so I'd recommend you wait and try it out. The upcoming firmware update is for all Open Firmware laptops and will still be in testing for a little while, so if you want to give it a try now, you can install Rust using the command on this website and then build/flash on your Oryx Pro using these commands:

git clone https://github.com/system76/firmware-open
./scripts/update.sh
./scripts/deps.sh
./scripts/build.sh oryp7
./scripts/flash.sh oryp7

The flashing script will power off your machine, so save any work you have open before running it. As long as you remain plugged into the charger through this entire process, it should be fairly low-risk. Once you're on the self-built firmware, your Oryx Pro will prompt you for a firmware "update," which you can install at any time through the GUI to go back to the regular published firmware.

If you find that the fan behavior is still not satisfactory, this issue should probably be transferred to the EC repository, since that is where most of the work on fan behavior happens. There's also some discussion about fan curves here: system76/ec#180

I follow these build instructions and this is the errors my Oryp8 spits out.

log.txt

I would really like to understand why I cannot build the firmware. I was able to build it just fine last month. Different install but same PC.

Localacct21 commented 2 years ago

still having issues trying to build the firmware @ahoneybun please help

make: Entering directory '/home/slt/firmware-open/apps/firmware-setup'
mkdir -p build/x86_64-unknown-uefi-drv
cargo rustc \
    -Z build-std=core,alloc \
    -Z build-std-features=compiler-builtins-mem \
    --target x86_64-unknown-uefi-drv \
    --release \
    -- \
    -C soft-float \
    --emit link=build/x86_64-unknown-uefi-drv/boot.efi
make: cargo: No such file or directory
make: *** [Makefile:45: build/x86_64-unknown-uefi-drv/boot.efi] Error 127
make: Leaving directory '/home/slt/firmware-open/apps/firmware-setup'
curiousercreative commented 2 years ago

@Localacct21 please try using a multi-line code block for that. To do that, triple back tick (`) to open and again to close, so: ``` blah blah some errors blah blah some errors blah blah some errors blah blah some errors ```

When I attempt to follow these instructions, I receive a different error. This may be resolved by running apt upgrade, but I'm holding off on that kernel upgrade as I run ZFS on this system:

git clone https://github.com/system76/firmware-open
cd firmware-open
./scripts/update.sh
./scripts/deps.sh

Installing system build dependencies
[sudo] password for curiouser: 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
bison is already the newest version (2:3.7.5+dfsg-1).
build-essential is already the newest version (12.8ubuntu3).
ccache is already the newest version (4.2-1build1).
cmake is already the newest version (3.18.4-2ubuntu1).
dosfstools is already the newest version (4.2-1build1).
flex is already the newest version (2.6.4-8).
libncurses-dev is already the newest version (6.2+20201114-2build1).
msr-tools is already the newest version (1.3-3).
mtools is already the newest version (4.0.26-1).
parted is already the newest version (3.4-1).
uuid-dev is already the newest version (2.36.1-7ubuntu2).
zlib1g-dev is already the newest version (1:1.2.11.dfsg-2ubuntu6).
avr-libc is already the newest version (1:2.0.0+Atmel3.6.2-1.1).
avrdude is already the newest version (6.3-20171130+svn1429-2).
devmem2 is already the newest version (0.0-0ubuntu2).
flashrom is already the newest version (1.2-5).
gcc-avr is already the newest version (1:5.4.0+Atmel3.6.2-1).
git-lfs is already the newest version (2.13.2-1).
gnat is already the newest version (10ubuntu1).
nasm is already the newest version (2.15.05-1).
python2 is already the newest version (2.7.18-2).
python2 set to manually installed.
sdcc is already the newest version (4.0.0+dfsg-2).
curl is already the newest version (7.74.0-1ubuntu2.3).
python3-distutils is already the newest version (3.9.5-0ubuntu3~21.04).
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 udev : Breaks: systemd (< 247.3-3ubuntu3.7pop1)
        Breaks: systemd:i386 (< 247.3-3ubuntu3.7pop1)
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
Failed to install dependencies!
jacobgkau commented 2 years ago

@curiousercreative If you're intentionally not updating your system (which is not generally recommended), you can try just updating udev with sudo apt install systemd udev/whatever specific packages need to be updated.

We have recently started packaging ZFS and I know that it works with the kernel that we're currently shipping, at least for a basic partitioning setup; it's one of the things currently preventing us from releasing kernel 5.16.

curiousercreative commented 2 years ago

@jacobgkau oh hey, thanks for sharing the ZFS update. I wasn't planning to hold back long, thought it'd be a few days. I was keeping an eye on this issue which can probably be updated and closed: https://github.com/pop-os/pop/issues/2032

MilesBHuff commented 2 years ago

@Raikiri I had exactly the same issue. Two things:

  1. Add thermal pads to the top of your CPU heatsink, so that as much of it as possible is in thermal contact with the bottom chassis, which itself is aluminium and actively cooled by the fans. This will expand your thermal capacity by quite a lot. I went from idling between 50, to around 40 -- and that's with the fans off.
  2. Try my fan curve, which is smoothly interpolated and hits 100% duty before Tjunction, so there should be no throttling. https://github.com/system76/ec/pull/179

As for paste, I did see improvements by repasting; but that'd be true for most laptops. My intuition is very much that your analysis of the CPU heatsink just not having enough capacity is correct.

universebreaker commented 2 years ago

glad that I found this github issue as I'm trying to solve the same cooling problem on my Oryp6 (which has almost the same internal structure and cooling system as Oryp7)

I recently re-pasted it with arctic silver 5, with single-digit percentage on CPU loading (web-gaming using the 2080) the CPU temp. can still go up to ~85 (core1~8 ranges from 85~64, with 2080 at ~64), and easily jumps to ~92 with ~10% CPU load.

I'm not sure if it's more about the heatsink's capacity or putting not enough paste on it (buttered toast, thin but enough to cover the die), but I'm planning to try Kryonaut and mod the cooling system (adding water cooling pipe onto it) at the same time. Wonder if anyone can provide the measure of the gap thickness between the pipes and the bottom chassis, then I can start modding it like the water cooling system on Eluktronics notebook.

Raikiri commented 1 year ago

Ok I got an update on this. Finally all my repasting tools arrived so I repasted my laptop with kryonaut. When I removed the heatsink, the previous layer of paste was a little dry but I don't think it was critically bad. The layer was uniform enough as far as I could tell.

But what really struck me is that there was basically no mounting pressure on the heatsink clamps. There are supposed to be springy levers that the screws are supposed to be pushing in, but these mounts were sitting flat against the screw holes with no tention in them at all. On this image: image screws 1-6 were all under 0 pressure, basically. I'm not sure if this is what caused my problem, but it's definitely not great, so I bent all these springs upwards a little bit so that they push the heatsink down at least slightly when I screw them back. After applying a fresh layer of kryonaut and reassembling the thing, so far seems to work alright.

I will need to run more tests before I can conclude whether the problem is now fixed or not. Definitely has not gotten any worse, but I don't have the same compilation test setup as I had in the OP.

UPD: after more tests under high load it looks like the situation has not improved by much. Thermals are pretty much exactly the same as they were before reapplying the thermal paste, at lest within my measurement error.

MilesBHuff commented 1 year ago

@Raikiri Try replacing the stock thermal pads with copper shims. Also try thermal-taping additional copper to the top of the heatsink. Then put a thermal pad on-top of that, so that it touches the aluminium chassis. That will add a shitload of extra thermal capacity, and even increase active cooling (since the chassis is air-cooled by the fans).

Also install thermald.

Fact of the matter is: there is no universe in which that tiny amount of heatsink (as shown in the picture) is going to be adequate to cool an i7. The only solution is to add more heatsink. Repasting, as you did, only increases the rate of thermal transfer from the CPU; that only helps if there's actually somewhere for that thermal energy to go.

You should also look into flashing your own custom fan curve to the EC; the default one is terrible, and doesn't reach 100% until 90ºC, long after the CPU has already started thermal-throttling.