sibradzic / amdgpu-clocks

Simple script to control power states of amdgpu driven GPUs
GNU General Public License v2.0
390 stars 43 forks source link

Auto card discovery #60

Closed husam212 closed 2 months ago

husam212 commented 3 months ago

This is a feature request, not a bug.

The card index cardX is expected to change sometimes with kernel updates, there's only one card in the system in most use cases. It would be nice if it automatically detects the card and apply a non-indexed config file /etc/default/amdgpu-custom-states.

To prevent applying incorrect config, maybe this detection should only work if there's a single card and a non-indexed config, otherwise it falls back to manual configuration.

sibradzic commented 3 months ago

Hi @husam212, thanks for your feedback.

I totally understand, it is pretty frustrating when the same card pops up with a different identifier, especially with prevalence of iGPUs inside CPU IO die. I've personally been hit with the issue of my dGPU cardX identifier being randomly assigned on each reboot.

However, something has changed with kernel 6.7 (or was it 6.8?), and now my iGPU always gets assigned with card0 and dGPU always ends up assigned with card1. Heck, even when iGPU is completely disabled in BIOS, my dGPU gets card1. I was too lazy to dig through the kernel commits to check exactly why is this the case, but seemingly it just works now, cards are no more randomly assigned... Could it be that kernel finally stabilized its API in this regard, like they once did to get rid of ethX enumeration mess?

So, in order to prevent the complexity creeping up into this super simple project, before I consider some possible workarounds, let me ask you few things first...

  1. How exactly does the kernel enumerate your card right now?
  2. Which distro & kernel versions have you tried?
  3. Which kernel versions changed the cardX enumeration behavior for you?
sibradzic commented 2 months ago

@husam212 ping?

husam212 commented 2 months ago

Hi @sibradzic sorry I missed your reply notification.

I have one dGPU only, the kernel used to assign it card0 and after some update it started assigning it card1. My use case is undervolting and downclocking to lower temps, at some point I figured the temps are higher than usual, that lead me to discover the GPU is no longer card0, I don't really know at which version update it flipped to card1. The kernel I'm using now is 6.8.7.

For now I added a symlink for card1 in case the kernel decided to assign card0 for some reason.

sibradzic commented 2 months ago

Yeah, same thing here, dGPU only (iGPU disabled in BIOS), was card0 and now is consistently card1. Let's see if anything changes with kernel 6.9 and address this accordingly.

In the meantime, I've worked out this patch that allows specifying custom state file with a card's PCI address suffix in its name, for example: /etc/default/amdgpu-custom-state.pci:0000:03:00.1. This way the custom state should always point to correct cardX, no matter how the card gets enumerated by kernel on boot. Feel free to test it out:

diff --git a/amdgpu-clocks b/amdgpu-clocks
index b26fa1f..b2e145e 100755
--- a/amdgpu-clocks
+++ b/amdgpu-clocks
@@ -217,10 +217,15 @@ function clear_vars() {

 check_ppfeaturemask

-for USER_STATE_FILE in "${USER_STATES_PATH}"*.card* ; do
+#TODO: update README with example about /etc/default/amdgpu-custom-state.pci:0000:03:00.1
+for USER_STATE_FILE in "${USER_STATES_PATH}"*.card* "${USER_STATES_PATH}"*.pci* ; do
   BACKUP_STATE_FILE=/tmp/${USER_STATE_FILE##*/}.initial

-  sys_device=/sys/class/drm/${USER_STATE_FILE##*.}/device
+  if [[ "${USER_STATE_FILE}" == *"pci"* ]]; then
+    sys_device=$(dirname /sys/class/drm/card*/device/consumer:${USER_STATE_FILE#*.})
+  else
+    sys_device=/sys/class/drm/${USER_STATE_FILE##*.}/device
+  fi
   sys_hwmons=( "${sys_device}"/hwmon/hwmon[0-9] )

   SYS_PP_OD_CLK=${sys_device}/pp_od_clk_voltage
husam212 commented 2 months ago

I checked my laptop which has a single Intel iGPU, it's assigned card1 too, so I think the kernel is not consistent in indexing GPUs.

The patch sounds like a good simple fix, I will give it a try.

The patch would still break if the GPU is moved to another PCI slot, right? I was thinking maybe the combination of device/vendor and device/device would be globally unique identifier for the GPU. That's probably a more complicated fix though.

sibradzic commented 2 months ago

The patch would still break if the GPU is moved to another PCI slot

Of course, but then how often would one move a card to different slot in reality? Also, it would almost never happen for an iGPU.

I was thinking maybe the combination of device/vendor and device/device would be globally unique identifier for the GPU

I don't think that would be the case. Imagine that you have two or more cards with identical PCI IDs, which is a typical scenario on multi-GPU servers, there would be no way to tell which card is which, solely based on device/vendor and device/device...

husam212 commented 2 months ago

I've been using the patch for a while now, it's working great. I guess the PCI address method solves the issue in the majority of cases. Thanks.

Will this patch get merged into master?

sibradzic commented 2 months ago

@husam212 merged. It turned out to be a bit more complex than the patch you tried earlier. Please try the latest master :bow: .