sibradzic / amdgpu-clocks

Simple script to control power states of amdgpu driven GPUs
GNU General Public License v2.0
394 stars 44 forks source link

card0 isn't a valid path, sometimes GPU is mounted as card1 #48

Closed Vixtron closed 1 year ago

Vixtron commented 1 year ago

image

Expected behavior: GPU clocks are set by writing to card0 by /etc/default/amdgpu-custom-states.card0

Actual behavior:

Nov 01 14:34:27 WS.local amdgpu-clocks[13908]: ls: cannot access '/sys/class/drm/card0/device/hwmon': No such file or directory
Nov 01 14:34:27 WS.local amdgpu-clocks[13902]: WARNING: /sys/class/drm/card0/device/pp_od_clk_voltage does not exist, skipping!

The solution could be to use the actual PCI path instead of a dynamic path.

sibradzic commented 1 year ago

hi @Vixtron

Thanks for reporting, but this particular issue has nothing to do with this very project, the amdgpu-clocks is not assigning cardX numbers by itself, it is Linux kernel and its driver modules that does that. Speaking of which, what cards do you have in your system, and what is your kernel and drivers are you using for those cards? Do you perhaps use an external Thunderbolt GPU enclosure, or some kind of notebook with a combination of iGPU & dGPU or similar?

As a potential workaround; verify which of your multiple cards are you changing clocks for, and double check that it is the same card that is consistently toggling between card0 and card1, and then just symlink /etc/default/amdgpu-custom-state.card0 to /etc/default/amdgpu-custom-state.card1. That would ensure that same settings would be applied to your card, regardless if kernel sees it as card0 or card1.

Vixtron commented 1 year ago

hi @Vixtron

Thanks for reporting, but this particular issue has nothing to do with this very project, the amdgpu-clocks is not assigning cardX numbers by itself, it is Linux kernel and its driver modules that does that. Speaking of which, what cards do you have in your system, and what is your kernel and drivers are you using for those cards? Do you perhaps use an external Thunderbolt GPU enclosure, or some kind of notebook with a combination of iGPU & dGPU or similar?

As a potential workaround; verify which of your multiple cards are you changing clocks for, and double check that it is the same card that is consistently toggling between card0 and card1, and then just symlink /etc/default/amdgpu-custom-state.card0 to /etc/default/amdgpu-custom-state.card1. That would ensure that same settings would be applied to your card, regardless if kernel sees it as card0 or card1.

I only have 1 dedicated card - RX580 and I'm using the amdgpu driver, since I updated to kernel 6.0.5 I noticed after rebooting that my card was mounted as card1 and my clocks were not being applied.

sibradzic commented 1 year ago

I only have 1 dedicated card - RX580

Your screenshot suggest otherwise. What does the ls -alh /sys/class/drm say? And lspci?

and I'm using the amdgpu driver

Yes, of course, but which amdgpu driver? Mainline kernel, distro specific, pro, something else? What does modinfo amdgpu say?

Tried the workaround?

Vixtron commented 1 year ago

I only have 1 dedicated card - RX580

Your screenshot suggest otherwise. What does the ls -alh /sys/class/drm say? And lspci?

and I'm using the amdgpu driver

Yes, of course, but which amdgpu driver? Mainline kernel, distro specific, pro, something else? What does modinfo amdgpu say?

Tried the workaround?

image

lspci output:

image

I'm running the open source kernel amdgpu driver of course. /lib/modules/6.0.5-200.fc36.x86_64/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.xz

image

Now you can see my GPU is mounted as card0 after I rebooted the pc, next time I reboot it will be card1 for some reason, no I haven't tried a workaround, but someone told me to try symlinking the card1 to card0 in case it mounts itself wrong but I don't see that as a good idea - as for the PCI path I would assume it would be the same issue.

sibradzic commented 1 year ago

next time I reboot it will be card1 for some reason

When that happens, what are the ls -alh /sys/class/drm and lspci saying?

someone told me to try symlinking the card1 to card0 in case it mounts itself wrong but I don't see that as a good idea

Someone told you what exactly? What about the potential workaround I told you about? So far it is the only idea that can help your case, please try that.

Vixtron commented 1 year ago

next time I reboot it will be card1 for some reason

When that happens, what are the ls -alh /sys/class/drm and lspci saying?

someone told me to try symlinking the card1 to card0 in case it mounts itself wrong but I don't see that as a good idea

Someone told you what exactly? What about the potential workaround I told you about? So far it is the only idea that can help your case, please try that.

Today I rebooted and it shows this image image

I don't think your symlink solution will work, because the symlink will be overridden by the card0 or card1 each time the PC reboots, maybe if I could symlink directories card1 -> card0 and card0 -> card1 it would work and I don't know if that is possible.

sibradzic commented 1 year ago

I don't think your symlink solution will work, because the symlink will be overridden by the card0 or card1 each time the PC reboots, maybe if I could symlink directories card1 -> card0 and card0 -> card1 it would work and I don't know if that is possible.

If you bother to read it properly you'll come to understanding that I ain't suggesting symlinking any /sys/class/drm directories at all, that wouldn't make any sense...

What I am suggesting is to make a symlink (or just plain good old copy) of an amdgpu-custom-state file, so that amdgpu-clocks would try to apply identical custom settings to both card0 and card1, every time it runs. Obviously, that would work for just one card, depending on which identifier is currently assigned to a card by the driver (it would just throw an error about the other, missing, card identifier), but it should give you the result you want.

walmartshopper commented 1 year ago

Using a symlink works. I have an amd card plus the intel iGPU. The card numbers seem to be random each boot. I went into /etc/default and did ln -s amdgpu-custom-states.card0 amdgpu-custom-states.card1 and the settings get applied no matter which card number gets assigned. The downside is that it will try to apply the settings to the intel card, fail, and then apply them to the amd card.