mcgillij / amdfan

Updated AMD Fan control utility forked from amdgpu-fan and updated.
https://mcgillij.dev/pages/amdfan.html
GNU General Public License v2.0
33 stars 7 forks source link

no compatible cards found, exiting amdfan.py:199 #6

Closed ghost closed 2 years ago

ghost commented 3 years ago

Hello Jason, when booting, the service is often (sometimes it works) not started via systemD with this error:

Mär 15 22:13:29 happy systemd[1]: Started amdfan controller. Mär 15 22:13:29 happy amdfan[470]: [22:13:29] ERROR no compatible cards found, exiting amdfan.py:199 Mär 15 22:13:29 happy systemd[1]: amdfan.service: Main process exited, code=exited, status=1/FAILURE Mär 15 22:13:29 happy systemd[1]: amdfan.service: Failed with result 'exit-code'. Mär 15 22:13:30 happy systemd[1]: amdfan.service: Scheduled restart job, restart counter is at 1. Mär 15 22:13:30 happy systemd[1]: Stopped amdfan controller. Mär 15 22:13:30 happy systemd[1]: Started amdfan controller.

--- systemd tries 5 times

Mär 15 22:13:31 happy systemd[1]: amdfan.service: Scheduled restart job, restart counter is at 5. Mär 15 22:13:31 happy systemd[1]: Stopped amdfan controller. Mär 15 22:13:31 happy systemd[1]: amdfan.service: Start request repeated too quickly. Mär 15 22:13:31 happy systemd[1]: amdfan.service: Failed with result 'exit-code'. Mär 15 22:13:31 happy systemd[1]: Failed to start amdfan controller.

Any idea?

mcgillij commented 3 years ago

It could be that the 'amdgpu' driver isn't in place yet when your systemd tries to start the process. Basically it checks for the existence of /sys/class/drm/card*/device/hwmon/* (which would be created by the amdgpu driver being loaded).

On some systems the amdgpu driver is blacklisted at boot, so it can be loaded later. Maybe you can check your /etc/modprobe.d/blacklist.conf

ghost commented 3 years ago

Hi Jason,

I tried to add the service to graphical.target.wants, but with no success. I too think its because card0 and hwmon is not available, which raises the startup-fail. Sometimes, it works. My startup time is pretty fast, maybe there is a problem with that?

systemd-analyze: Startup finished in 14.244s (firmware) + 62ms (loader) + 1.765s (kernel) + 1.510s (userspace) = 17.582s graphical.target reached after 1.421s in userspace

mcgillij commented 3 years ago

I don't think it's so much the speed as it is the order of things, I think you're on the right track as far as finding the correct place for it to load in your systemd config though. What distro are you using? Maybe we can figure out the order of things for it.

Either move amdfan to after amdgpu is loaded, or move amdgpu loading to be sooner (it could only be getting loaded by X or wayland) potentially depending on what you have going on there.

ghost commented 3 years ago

yeah, i guess both processes running at the same time, sometimes the amdgpu driver loading is faster, then everything is ok i think amdgpu driver is loaded from kernel but then later used by x system is arch with linux 5.11.6 see this partly from journalctl -b

Mär 17 08:56:06 happy systemd[1]: Queued start job for default target Graphical Interface.

[...]
[first start attempt for amdfan]
Mär 17 08:56:06 happy systemd[1]: Started amdfan controller.
[...]
[first appearance of amdgpu]
Mär 17 08:56:06 happy kernel: [drm] amdgpu kernel modesetting enabled.
[amdgpu initialisation stuff]

[amdgpu not ready yet]
Mär 17 08:56:06 happy amdfan[469]: [08:56:06] ERROR    no compatible cards found, exiting             amdfan.py:199
Mär 17 08:56:06 happy systemd[1]: amdfan.service: Main process exited, code=exited, status=1/FAILURE
Mär 17 08:56:06 happy systemd[1]: amdfan.service: Failed with result 'exit-code'.

Mär 17 08:56:07 happy systemd[1]: amdfan.service: Scheduled restart job, restart counter is at 1.
Mär 17 08:56:07 happy systemd[1]: Stopped amdfan controller.
Mär 17 08:56:07 happy systemd[1]: Started amdfan controller.
Mär 17 08:56:07 happy amdfan[575]: [08:56:07] ERROR    no compatible cards found, exiting             amdfan.py:199

Mär 17 08:56:07 happy systemd[1]: amdfan.service: Main process exited, code=exited, status=1/FAILURE
Mär 17 08:56:07 happy systemd[1]: amdfan.service: Failed with result 'exit-code'.
Mär 17 08:56:07 happy systemd[1]: amdfan.service: Scheduled restart job, restart counter is at 2.
Mär 17 08:56:07 happy systemd[1]: Stopped amdfan controller.
Mär 17 08:56:07 happy systemd[1]: Started amdfan controller.
Mär 17 08:56:07 happy amdfan[633]: [08:56:07] ERROR    no compatible cards found, exiting             amdfan.py:199

Mär 17 08:56:07 happy systemd[1]: amdfan.service: Main process exited, code=exited, status=1/FAILURE
Mär 17 08:56:07 happy systemd[1]: amdfan.service: Failed with result 'exit-code'.
Mär 17 08:56:07 happy systemd[1]: amdfan.service: Scheduled restart job, restart counter is at 3.
Mär 17 08:56:07 happy systemd[1]: Stopped amdfan controller.
Mär 17 08:56:07 happy systemd[1]: Started amdfan controller.
Mär 17 08:56:07 happy amdfan[662]: [08:56:07] ERROR    no compatible cards found, exiting             amdfan.py:199
Mär 17 08:56:07 happy systemd[1]: amdfan.service: Main process exited, code=exited, status=1/FAILURE
Mär 17 08:56:07 happy systemd[1]: amdfan.service: Failed with result 'exit-code'.

[now it seems to be ready]
Mär 17 08:56:08 happy kernel: [drm] Initialized amdgpu 3.40.0 20150101 for 0000:09:00.0 on minor 0

[amdfan succeds!]
Mär 17 08:56:08 happy systemd[1]: amdfan.service: Scheduled restart job, restart counter is at 4.
Mär 17 08:56:08 happy systemd[1]: Stopped amdfan controller.
Mär 17 08:56:08 happy systemd[1]: Started amdfan controller.
Mär 17 08:56:08 happy amdfan[690]: [08:56:08] INFO     Starting amdfan                                amdfan.py:208

i guess the best solution could be not to exit amdfan when no card is detected, but instead look for a card every 1 second or so until the card is ready?

btw, thank you for this aur, its great!

mcgillij commented 3 years ago

Yeah I'm running the same kernel distro, and I've never had it not start with either my vega64 or 6800xt.

Couple things from the Arch wiki: relating to grub/bootloader: Make sure you do not have nomodeset or vga= as a kernel parameter, since amdgpu requires KMS.

https://wiki.archlinux.org/index.php/Kernel_mode_setting#Early_KMS_start since we know that it's the kernel module not being loaded in time, we can load it early.

ghost commented 3 years ago

im using efistub and none of the options nomodeset or vga i tried it right now with early kms and the amdgpu seems to loaded now before amdfan

[amdgpu ready] Mär 17 13:18:35 happy kernel: [drm] Initialized amdgpu 3.40.0 20150101 for 0000:09:00.0 on minor 0 [...] Mär 17 13:18:35 happy systemd[1]: Started amdfan controller. Mär 17 13:18:36 happy amdfan[491]: [13:18:36] INFO Starting amdfan amdfan.py:208 [...]

as we see, this loads amdgpu early enough to be fully available via hwmon when amdfan is started by systemd. thank you for your support.

still i suggest the change with a check every second as mention above, since others may run into the same issue. with this, amdfan would always run out of the box and there is no need to edit the mkinitcpio.conf.

mcgillij commented 3 years ago

Yeah I'll consider it.

I'll see if maybe I can get systemd to retry for longer, maybe set a longer check interval.

I do think the behavior of the application is correct, I think it should exit if no cards are found, just a matter of ironing out how to get it to load after the kernel module. Thanks for reporting the issue though I appreciate it.

ghuser0123 commented 2 years ago

I had the same issue.

# amdgpu-fan.service
[Unit]
Description=amdgpu fan controller
StartLimitInterval=225
StartLimitBurst=10
[Service]
ExecStart=/usr/bin/amdgpu-fan
Restart=always
RestartSec=2

[Install]
WantedBy=multi-user.target

This seemed to fix it. Had it retry a few times with a delay for each retry.