pop-os / system76-power

Power profile management for Linux
GNU General Public License v3.0
584 stars 72 forks source link

GALP5 - Broken power curve #219

Open ZeddieXX opened 3 years ago

ZeddieXX commented 3 years ago

Distribution (run cat /etc/os-release): NAME="Pop!_OS" VERSION="20.10" ID=pop ID_LIKE="ubuntu debian" PRETTY_NAME="Pop!_OS 20.10" VERSION_ID="20.10" HOME_URL="https://pop.system76.com" SUPPORT_URL="https://support.system76.com" BUG_REPORT_URL="https://github.com/pop-os/pop/issues" PRIVACY_POLICY_URL="https://system76.com/privacy" VERSION_CODENAME=groovy UBUNTU_CODENAME=groovy LOGO=distributor-logo-pop-os

Related Application and/or Package Version (run apt policy $PACKAGE NAME): system76-power: Installed: 1.1.14~1612296011~20.10~876e3c7 Candidate: 1.1.14~1612296011~20.10~876e3c7 Version table: *** 1.1.14~1612296011~20.10~876e3c7 1001 1001 http://ppa.launchpad.net/system76/pop/ubuntu groovy/main amd64 Packages 100 /var/lib/dpkg/status

Issue/Bug Description: Before the update, in battery mode, I can get 2.4GHz max out of the CPU, and when idle it falls but never dip into the MHz range. Now it starts at 2.4 GHz but falls to 218 MHz! That's even when there's full load on the CPU! This makes everything unusable! The Activities screen comes up laggy and choppy after a few seconds of pushing the Super key.

Before the update, Balanced and Performance mode were pretty much the same, letting me go as high as 3.6 GHz (falling back to 3.4 GHz with sustained full CPU load ). But after the update, Balanced peaks at 3.8 GHz but gets 3.6 GHz with sustained CPU load and Performance peaks at 4 GHz then settles to 3.9 GHz in sustained CPU load. CPU fan spins up higher and stays on as a result.

Did you guys change something in PopOS?

Steps to reproduce (if you know): I used cpuminer-opt to do CPU load tests (cryptominer), but it doesn't matter what you use. CPU freq will fall too low to a frustrating usability state. I used cpu-x to monitor CPU frequencies.

(Disclaimer - I don't mine with my GALP5 - it was only to test load the CPU, and it serves as my CPU benchmark to compare with other CPUs that I do mine with).

Expected behavior: CPU should not fall to 218 MHz with sustained load. This should only happen in idle mode. There should be a minimum speed of 400 Mhz (4x), not 2.2x.

Also the other modes now make GALP5 louder for longer, although it gives better performance at the cost of heat and fan noise.

Other Notes:

jacobgkau commented 3 years ago

This is probably the result of https://github.com/pop-os/system76-power/pull/212, a PR submitted by a community member to make the power profiles have a greater effect. We may be able to balance it out a bit, although what we're running into here would be different users' preferences being different from one another.

The limits for the power profiles (or at least the limits that were changed in that PR) are set in power consumption (measured in watts), not frequency, so simply setting a "minimum speed of XYZ MHz" might not be that simple.

The Action screen comes up laggy and choppy after a few seconds of pushing the Action key.

I'm assuming you're talking about the Activities menu that comes up when pushing the Super key. I'm not seeing this while pinning the CPU with stress or stress-ng in integrated mode, although it isn't much of a surprise to hear. I generally wouldn't recommend running CPU-heavy tasks such as crypto mining (as in your example) in battery mode, or on battery power; while this is obviously a laptop and is portable, stressful tasks like that will either run slower or will drain the battery quickly enough that it's not very useful anyway.

jackpot51 commented 3 years ago

@ZeddieXX do you have the variant with an NVIDIA GPU?

I will go through and tweak the settings a bit. We want battery mode to be fanless, balanced mode should be set up the same way before and after that update, and preformance mode should max out the thermals.

ZeddieXX commented 3 years ago

I have the non-GPU variant.

I don't run crypto miners for any significant amount of time. It's what I use to stress test CPUs since I'm familiar with the hash rate as a comparison of performance with other CPUs I've used.

Even with preferences out of the way, a loaded CPU shouldn't drop CPU multiplier to 2.2x (218 MHz according to CPU-X).

My suggestion is more in line with battery consumption and fan noise and per Intel's spec's of 4x to 47x multiplier (1165G7)

Battery Mode: Minimum 4x - but only when near idle. Maximum 22x - Whatever the fastest the CPU can run without kicking off the fan.

Balanced Mode: Minimum 6x Maximum 28x - Whatever the fastest the CPU can run with the fan maxed at 3000 RPMs (still quiet).

Performance Mode: Minimum 8x Maximum 47x - Max out the multiplier and fans, let thermal throttle the CPU as needed.

UPDATE: The original figures were guesses. After testing with more granular control over CPU freq, my new suggestions for power profile BASED ON stock fan curves for non-dGPU (CPU-only fan). https://github.com/pop-os/system76-power/issues/219#issuecomment-774236777

I'm more about the noise level. If we can sync the GPU and CPU fans per @curiousercreative suggestion, Balanced Mode can probably go back to 35x with both fans at 2400 RPMs, leaving Performance Mode as a pure maxed out mode (max CPU multiplier, let temps and fan noise fall where they may).

jackpot51 commented 3 years ago

CPU power limits don't work on frequency or boost. They work on power output. The best way to measure power output is to run the power.sh script in the system76/ec repository for a while, use CTRL+C to exit, and view the power.csv file. It records the CPU power usage, power limits, temperature, temperature limit, and frequency for every core. If you can do that and provide the power.csv file to me, it would help in finding better settings

jackpot51 commented 3 years ago

I was able to replicate these conditions after stress testing for about 10 minutes. However, after killing the stress test I immediately got back to good frequencies of about 2.1 GHz. Battery mode is not really intended for continuous full CPU load, so I don't believe this is an issue.

ZeddieXX commented 3 years ago

I was able to replicate these conditions after stress testing for about 10 minutes. However, after killing the stress test I immediately got back to good frequencies of about 2.1 GHz. Battery mode is not really intended for continuous full CPU load, so I don't believe this is an issue.

Even if that is the case, why is it throttling down to 218 MHz on continuous full CPU load? Previous behavior would just keep it around 2.4 GHz (or whatever the power limit is set for Battery Mode).

My use case of using it for heavy work loads in Battery Mode to limit heat and noise will no longer be feasible. Going forward, I will not be able to use the GALP5. What can I do? Is it too late to return for full refund?

jackpot51 commented 3 years ago

It is throttling because it hits 68C, which is the thermal limit on battery mode to allow for passive cooling. I was not aware of users running heavy workloads in Battery mode, and right now it is not an intended use case. We would like to have a mode that is always fanless on supported laptops.

It seems like you need more granular control over the frequency than what we can provide. I would recommend setting the system up in Balance mode, and then using a more advanced set of settings such as https://extensions.gnome.org/extension/1082/cpufreq/ to precisely control CPU frequency limits to get your desired heat output.

ZeddieXX commented 3 years ago

That makes more sense now. I didn't realize there was a thermal limit of 68c for Battery Mode.

The Gnome extension looks interesting, but the latest version of Gnome they offer it for is 3.34. Pop!_OS 20.10 is on Gnome 3.38.

Edit: Never mind. I just installed the Firefox extension and clicked on the "On" switch, and it installed. Thanks! It am now able to limit it to 2.8 GHz on Balanced Mode, and fan noise isn't so bad at 3700 RPMs. This is still with the GPU fan still off. Hopefully once we get both fans sync'd, RPMs can be lower for both to maintain temps at 75C

jackpot51 commented 3 years ago

This is an alternative that might work out: https://extensions.gnome.org/extension/945/cpu-power-manager/

curiousercreative commented 3 years ago

@ZeddieXX another one that might be helpful, search "CPU frequency settings" in pop shop.

ZeddieXX commented 3 years ago

The gnome extension you gave me ( https://extensions.gnome.org/extension/1082/cpufreq/ ) works perfectly for me. It allows me to fine-tune the CPU freq to not go over the temp thresholds for each level of the fan ramp-up speeds.

I was able to quickly find out: 2.2 GHz - 2400 RPMs - 73C - Still very quiet and unnoticeable 2.8 GHz - 3700 RPMs - 77C - Fan noticeable, but reasonable (note: This the base freq per Intel). 3.2 GHz - 4600 RPMs - 83C - Fan very noticeable, still bearable if performance is needed 3.6 GHz - 5700 RPMs - 87C - Fan is uncomfortably loud

Beyond 3.6 GHz, I figure a user would probably just don't care about fan noise (at 100%, fan is at 6300 RPMs) and will just want all-out performance, so I'll probably have a "Full Power" profile without any limitations except thermal. (Max CPU multiplier until you hit 91C, fans will do everything in it's power to keep at 91C). In my experience, the CPU will be at 37x (3.7 GHz) at 94C with single fan at 100%. I think we basically hit the thermal limit. With both fans on 100% (Fn+1), the temps get to 77C and I am able to sit at 40x (4 GHz). That's a good number for all-core full load. I suspect lightly threaded loads will hit the max 4.7 GHz.

Basically I wanted the maximum performance per fan stepping, and this utility help me find out those points. May be helpful for fine-tuning @curiousercreative new sync'd CPU and GPU fan curve.

And as a new note, I now know that 68C is the cutoff for fanless, but if that's the case, I'd implore you to see if you can run a CPU on full load at 4x (400 MHz) which is the minimum speed per Intel, and see if you can keep it at 68C at the slowest maintainable fan speed. Even though its not technically "fanless", as you said, full CPU load at the "fanless" mode is not the usual use case, but sometimes user's don't have that choice (background processes, updates, etc). If it slows the laptop to an unusable state (2.2x, or 218 MHz in my case) to keep it at 68C fanless, it's practically unusable. I rather the fan kicks in a bit to keep it at 400 MHz (or whatever is usable) so the user can find out what's causing the CPU utilization and stop it.

Here's my current CPU profiles using that extension. I call it my "Fan Noise Profile", lol.

Fan Noise Profiles

I wish I could reorder the custom profiles. It's kinda bugging me, lol.

curiousercreative commented 3 years ago

I'm unable to reproduce such low frequencies. Ater 15 minutes of full load on the battery profile my steady state is 1.4-1.7GHz, 65-67C, 1200-1500RPM on both fans. I've not updated the power package, just running the one I built for the PR, so either I'm not reproducing the frequency bottoming out as a result of package differences, my custom fan curve (+ fan syncing) or my choice of torture test (prime95).

jackpot51 commented 3 years ago

@ZeddieXX I'll think about it. Sounds like raising the thermal throttle limit to 73C and letting it run at the lowest fan speed if it is being stressed might work out. It would still go back to fanless when it isn't under a heavy load

jackpot51 commented 3 years ago

@curiousercreative you have to get up to 68C before it starts reducing the frequencies. For me it was 10 minutes of this command:

stress -c 4 -i 4 -m 4 -d 4
curiousercreative commented 3 years ago

@jackpot51 yeah, I bet my lowest fan point of 65C keeps me from reproducing

jackpot51 commented 3 years ago

@curiousercreative that makes sense to me. Have to be fanless to reproduce the issue

ZeddieXX commented 3 years ago

@curiousercreative

I was not able to follow your steps to download and compile your custom EC.

The topic of CPU frequency, temps, and fan curves are all related. So right now all of my data points are using the stock fan curves (the stair-stepping ones). I'm sure if @curiousercreative sync'd fans and fan curves are used, the data points for CPU temps, frequencies, and fan speeds will be different, and the power modes will have to be adjusted to accommodate for it (though I suspect it would be a better power/performance ratio and user experience overall).

curiousercreative commented 2 years ago

@ZeddieXX is this still a problem for you? I've encountered this throttling in the past year, but as this comment makes clear I don't think it's a supported use case for battery profile. Battery power profile to me says "we'll attempt to extend your battery life, stay cool and silent, all at the cost of performance". If we're being throttled on battery profile, switch to balanced profile.

It's worth mentioning though that only lemp9 and galp5 have these thermal throttles configured to make battery profile truly fanless. I could be wrong of course.

ZeddieXX commented 2 years ago

@ZeddieXX is this still a problem for you? I've encountered this throttling in the past year, but as this comment makes clear I don't think it's a supported use case for battery profile. Battery power profile to me says "we'll attempt to extend your battery life, stay cool and silent, all at the cost of performance". If we're being throttled on battery profile, switch to balanced profile.

It's worth mentioning though that only lemp9 and galp5 have these thermal throttles configured to make battery profile truly fanless. I could be wrong of course.

I haven't really used the laptop much since the issues with the boot entries going missing. Been waiting on the new firmware to release. I just recently (a few days ago) compiled and flashed the current main branch but I haven't had much time with the laptop. I did a fresh install of Arch, then Fedora, and now back to a fresh install of PopOS 20.10. I haven't really used it in any real capacity.

curiousercreative commented 2 years ago

@ZeddieXX After I asked whether this still affects you, it affected me! While nobody should plan to run heavy and extended workloads on the battery power profile, a hung process or something else unexpected can quickly hit the thermal throttle and result in a DE that is almost totally unresponsive and very difficult to change the power profile out of battery. I suppose we can use fn+1 to spin the fans at full speed and release the heavy thermal throttle... doesn't really solve for a machine that's unattended though :/ I may open a PR implementing this idea and see how often a light workload is silent vs just quiet.