tuxedocomputers / tuxedo-control-center

A tool to help you control performance, energy, fan and comfort settings on TUXEDO laptops.
GNU General Public License v3.0
522 stars 69 forks source link

TCC prevents dGPU from going to D3cold state #360

Closed salihmarangoz closed 7 months ago

salihmarangoz commented 9 months ago

PCI-Express Runtime D3 (RTD3) Power Management is a very important feature for achieving longer battery times (driver readme). While D0 being the highest power state, it may be switched to other states for power-saving. With RTD3, it is possible to power-off dGPU with D3cold state. Print the state via this command:

$ cat /sys/bus/pci/devices/0000:01:00.0/power_state

The problem is, TCC keeps the dGPU awake all the time. Exiting the TCC is not enough. I had run sudo service tccd stop to let the dGPU switch to D3cold state. After I started the tccd service again, it was still in D3cold, but after opening the TCC again it switched to D0 state.

This is frustrating because I lower CPU freq using TCC for battery saving, but it wastes energy by keeping the dGPU awake.

Maybe related to #341

tuxedoxt commented 9 months ago

Hello,

leaving the TCC dashboard should be enough to stop polling dGPU for info and let it go to d3cold. Going to the dashboard alone should also not wake the dGPU if at the time it's in d3cold.

Does this fit? Any other behaviour I would characterize as a bug.

silles79 commented 9 months ago

For me opening the dashboard does change d3cold to d0 and keeps it there until closing it and will go back to d3cold.

salihmarangoz commented 9 months ago

I checked with forkstat for catching newly created processes. I see this process is running routinely: /bin/sh -c nvidia-smi --query-gpu=power.draw,power.max_limit,enforced.power.limit,clocks.gr,clocks.max.gr --format=csv,noheader

And yes, the TCC GUI is running in the background. But it doesn't happen all the time. I think I found a behavior that may be important to discover the bug. Maybe you can also reproduce it:

  1. Open TCC.
  2. Wait for CPU and iGPU values to refresh.
  3. Switch to the dGPU tab.
  4. Wait for a few seconds and then close the window.
  5. Sometimes it keeps calling nvidia-smi on the background.

Screen record: https://www.youtube.com/watch?v=rfK6HyMgoCM

salihmarangoz commented 9 months ago

Maybe implementing a heartbeat technique for checking that hardware information subscribers are alive would be a solution. For example, if there is no heartbeat from the GUI for 10 seconds, the daemon can assume that the GUI is killed/crashed or unable to respond.

I checked the code but I really dont know Electron. However, after seeing many async functions this came to my mind as a solution.

tuxedoder commented 9 months ago

The expected behavior is, that during initial startup the tcc will wake up the dGPU and thus metrics will be collected because the dGPU got woken up. The initial wakeup of the dGPU seems to happen because of electron and not because of the code itself. Once open, minimizing the window or leaving the dashboard to another part of the tcc should disable collection of metrics. Opening the minimized application will then not wake up the dGPU in the dashboard if it is in d3. Here some videos with 22.04.4 LTS via our FAI.

https://github.com/tuxedocomputers/tuxedo-control-center/assets/160256398/cca680fa-f8e4-465e-a007-e1da2597539b

It is odd that it collects data once the tcc is closed and that is indeed a bug. It is a bit hard to reproduce, but I can replicate it after several attempts. I need more time to analyze and think about it.

https://github.com/tuxedocomputers/tuxedo-control-center/assets/160256398/dac478b6-7509-421d-8b8b-6a31e0cb0c28

silles79 commented 9 months ago

Did some testing and it looks like the UI wakes up the dGPU, but only "closes" it when u switch back to iGPU tab and close it.

exiting the app while dGPU tab is open doesn't stop polling the dGPU. Either u have to kill the tccd daemon or start the UI and switch back to iGPU and exit UI

so I'd say this is a bug

tuxedoder commented 9 months ago

but only "closes" it when u switch back to iGPU tab and close it.

The dashboard does not differentiate between which tab is visible. Data is collected for all gauges while the dashboard component is visible to ensure a seamless transition between tabs.

u have to kill the tccd daemon

That resets tccd to default values, which is off for dGPU data collection.

It is odd that it collects data once the tcc is closed and that is indeed a bug. It is a bit hard to reproduce, but I can replicate it after several attempts. I need more time to analyze and think about it.

As a small update, debugging was not easy because I could not consistently reproduce it. Adding more verbose debug logging seemingly fixed this issue, making it harder for me to analyze the code. The dashboard component is calling the required functions and turns off the data collection. However, sometimes the dbus does not show that a value was actually set. I think tcc terminates before the signal was sent to tccd, causing tcc to be unable to always turn collection off. It appears to be a race condition.

I have considered various solutions, and a fix should arrive soon. To summarize the current idea, I plan to wait in electrons close event for a tccd value to ensure the data collection status is set correctly in normal operation. Additionally, a timeout in tccd to automatically turn off data collection if the gpu dbus functions are not called, ensuring the status is maintained if tcc crashes or closes unexpectedly.

tuxedoder commented 8 months ago

As a small update, I tried to put various things into the next release and it got a bit delayed. Maybe this week if things go well.

tuxedoder commented 7 months ago

Should be fixed in 2.1.8.