pop-os / system76-power

Power profile management for Linux
GNU General Public License v3.0
594 stars 74 forks source link

CPU usage increases over time and eventually causes audio to stutter #360

Open leviport opened 2 years ago

leviport commented 2 years ago

Distribution (run cat /etc/os-release): Pop!_OS 22.04

Related Application and/or Package Version (run apt policy $PACKAGE NAME): Presumably system76-power, version unknown

Issue/Bug Description: 2022-09-14_15-38 https://twitter.com/emptythevoid/status/1570150727300923393

Steps to reproduce (if you know): Not entirely clear, but presumably the machine just has to be left running with music playing/looping in the background for a while? The time frame to reproduce the bug isn't clear to me. It looks like it happens pretty quickly in that video, but prior uptime is unknown.

Expected behavior: system76-power should not hog CPU, and audio playback should not stutter

Other Notes: I assume this is on a kudu6, judging by the core count in powertop in the video.

emptythevoid commented 2 years ago

Hey, this is Greg. Thank you for creating the issue in github. I'll clarify what I can.

Steps to reproduce: Leave PopOS running for some time. I've experienced this problem becoming noticeable in different timeframes. After a restart, it usually takes at least a full day to become noticeable, although I've also experienced it becoming a problem only after two days (although rare). I'm running in nvidia-only mode when I notice this, but I don't know if it happens when it's running integrated-only.

This is, indeed, a Kudu 6. This issue has been present since I performed a fresh install of PopOS 22.04 (I originally installed Ubuntu Mate as soon as my laptop arrived and did not experience this behavior. It was only after PopOS was installed). That's been about four months now.

I didn't make the connection to the system76-power until a couple days ago. Since it was audio-related, I've tried all manner of tweaks and adjustments to pipewire, but these didn't seem to have any noticeable impact. I've even disabled the system76-scheduler to see if that helped. I just happened to notice that the system76-power was relatively busy one night when the audio was degraded, so I thought I'd kill it and it instantly improved. This behavior also occurs when using a USB headset, regular wired headphones, or the speakers. I do lock my computer often throughout the day, and systemctl status system76-power shows a DBUS Received GetGraphics() method and GetProfile() method each time this happens, if that means anything useful.

Other notes: My laptop usually stays tethered to a Dell WD19S dock with two monitors, but the video I posted to twitter was done without the dock while running on battery. Still, the computer was on for the full day today, and the stuttering only started this evening. I haven't been observing system76-power long, but the highest I've ever seen it go was about 23% CPU according to htop. If there is something I can do to get more information about this process when it happens, I'll be very happy to test.

emptythevoid commented 2 years ago

Here's a screenshot of an strace on system76-power just after a reboot. It's using between 0% and 0.7% CPU currently.
Screenshot from 2022-09-14 20-33-08

emptythevoid commented 2 years ago

My computer has been left on overnight, locked. Just before I went to bed, system76-power was between 0.7% and 1.3%. This morning, it's bumped to between 2.0% and 4.6%. Systemd reports that's about 12 full hours.

mmstick commented 2 years ago

I'm looking for a possible cause. Any difference in strace output after that day?

emptythevoid commented 2 years ago

I have no idea what I'm looking for, so I don't know what's different. Here's a screenshot of what it looks like today.
Screenshot from 2022-09-15 11-24-42

Guide me on how I can get you more information.

Edit: system76-power is currently between 2.6% and 5.9% as of this strace screenshot, Current uptime is about 15 hours. systemd reports 18 minutes of CPU time for system76-power.

mmstick commented 2 years ago

Maybe you've noticed it outputting something more frequently than before

mmstick commented 2 years ago

I think it's safe to assume that it's not the fan daemon or hotplug detection since these are only executed once per second. I don't think strace will be useful. May need to run this through a proper profiling tool.

emptythevoid commented 2 years ago

It looks like it's doing openat(AT_FDCWD for /sys/class/hwmon/hwmon 1-5, does some lseek and reads, a couple of clock_gettime, one write, an epoll_wait(3 , then a couple more clock_gettime. This seems to look the same as when I started it, but I'm not sure anything would stand out to me unless it looked significantly different. htop is polling the strace every second, and it looks like the all the stuff I listed above also happens every second.

mmstick commented 2 years ago

Yeah it might be more helpful to get output from a flamegraph using a tool like this. It's most likely some application-level code.

emptythevoid commented 2 years ago

Is this of any use?
flamegraph

mmstick commented 2 years ago

No, doesn't contain any function calls

emptythevoid commented 2 years ago

Here's one I ran for a couple seconds. Does this help, or still useless flamegraph

mmstick commented 2 years ago

No, no functions from system76-power itself

emptythevoid commented 2 years ago

I'm running the command as sudo ~/.cargo/bin/flamegraph --bin=system76-power daemon

Should I be doing something different?

mmstick commented 2 years ago

cargo flamegraph --root daemon from the project directory with the system76-power daemon stopped in advance.

emptythevoid commented 2 years ago

Is this heading in the right direction? This was generated from in the .cargo/bin directory with the command: flamegraph --root -- /usr/bin/system76-power daemon

flamegraph

mmstick commented 2 years ago

No, it requires building system76-power from source code to have all the debugging symbols and function call information.

emptythevoid commented 2 years ago

Oooooh, okay. Do you want me to try to do that? If so, what compilation command and flags should I use?

emptythevoid commented 2 years ago

I'm really sorry. I'm still trying to figure out what I'm doing.

See if this is any better. I edited the makefile to enable debugging (I think) and compiled it. The binary went into the debug directory, so I'm assuming that got it.

I navigated to the debug directory and ran: flamegraph --root -- ./system76-power daemon

And got this graph flamegraph

mmstick commented 2 years ago

This is good. Let it run for several hours and compare the results tomorrow to see if something stands out more as time goes on.

emptythevoid commented 2 years ago

You got it. Thank you. So sorry for the confusion

emptythevoid commented 2 years ago

Left my computer running over night. system76-power was using about 5.6% at the time I stopped it. flamegraph collected about 17GB of perf data and took about an hour-and-a-half to generate the graph. Here it is. flamegraph

mmstick commented 2 years ago

Seems that it's entirely hotplug detection. Unusually high CPU demand from that codepath. Will investigate optimizations.

emptythevoid commented 2 years ago

Okay, so let me add to that point. I'm using a WD19S USB-C dock with two monitors (one is connected by USB-C to the dock, the other is by display port). Now, I don't notice any issue with this set up, and the systemd status for system76-power doesn't show a hotplug event unless I unplug the dock and plug it back in. However, the very few times I use Discord, it will occasionally and randomly mention that a new Nvidia HDMI audio source was detected, and I never use that for audio. This may be a red herring, but it's the only anomaly I can think of that might be related at all. Please let me know if there is anything else I can do to assist.

mmstick commented 2 years ago

Every second the service checks for display hotplug events. I'll try to reduce the amount of reads that it's performing.

mmstick commented 2 years ago

Try running the perf branch and see if that improves CPU usage.

emptythevoid commented 2 years ago

This was from system76-power (perf branch) running over night. By the time I stopped flamegraph, it collected about 45GB of data. Edit: the system76-power process was using just about 6% CPU at the time I stopped it

flamegraph

emptythevoid commented 9 months ago

Any update on this? I'm still experiencing this on my Kudu. I have to manually stop system76-power after booting up, and then because that's not running, I have to manually change to integrated/nvidia. It's not terrible, but it's very annoying to have to keep doing this, and I'm really hoping for a fix.