pcengines / apu2-documentation

Documentation and scripts for building and adjusting PC Engines APU2 firmware
https://pcengines.github.io/apu2-documentation/
208 stars 46 forks source link

[APU2] Kernel error: gpio-keys-polled gpio-keys-polled: unable to claim gpio 0, err=-517 #204

Open silentcreek opened 3 years ago

silentcreek commented 3 years ago

Hi,

after upgrading my APU2 from Debian 10 (Kernel 4.19) to Debian 11 (Kernel 5.10) I'm getting a couple of these error messages at boot time: kernel: gpio-keys-polled gpio-keys-polled: unable to claim gpio 0, err=-517

I'm also noticing that the first LED stays off which was not the case before.

I found a forum discussion [1] that suggests that this issue should be fixed in Linux 5.3 and that this is likely a module ordering issue. The former can't be true since my kernel is newer than 5.3, obviously. And as for the latter: I tried explicitely loading the mentioned modules leds-gpio and gpio_amd_fch in different orders by adding a cofiguration file in /etc/modules-load.d/ but that didn't work.

I also tried upgrading to the latest coreboot version v4.14.0.4 (was on some 4.12.x release before), but that didn't help either.

Does anyone know how to fix this? Thanks!

Timo

[1] http://pcengines.spruz.com/forums/?page=post&id=F1D75D99-D369-4310-B098-2F222A3CDBAC&fid=1A77794F-FF7D-44CA-AF64-CAA2588102ED

miczyg1 commented 3 years ago

HI, we have just recently noticed the issue. I doubt it is a module ordering issue, because the ACPI GPIOs definitions used by gpio-keys-polled come from the GPIO controller which binds to gpio_amd_fch, so one shouldn't work without the other. THe real problem lies within the Linux kernel itself... There is a separate driver added to the kernel for apus: https://github.com/torvalds/linux/blob/master/drivers/platform/x86/pcengines-apuv2.c So basically the error indicates that these GPIOs have been claimed by the driver already. I think we have no other choice but to remove the ACPI bindings for the GPIOs

silentcreek commented 3 years ago

Okay, thanks. I think I understand now... So, we have the modules pcengines_apuv2 and leds_gpio (or gpio_keys_polled?) fighting over the gpio control of these leds. That means as a workaround, we could simply blacklist pcengines_apuv2 until a better fix lands either in the kernel or firmware. I actually just tried that (blacklisting pcengins_apuv2) and the error messages are gone now, while the leds are still operational (and the first led is now on again, once the APU2 is booted). So, that seems to work. Is there any crucial functionality missing if I blacklist this module? I'm guessing no, since the module wasn't available on my older Debian 10 kernel anyway and looking at the code you linked, the only other thing that seems interesting to me is the front button, but I have never used it before either, so I guess I could sacrifice it for a temporary workaround.

miczyg1 commented 3 years ago

There is no difference between ACPI definitions of the buttons and leds and the kernel driver. Bot support the leds and front button in the same manner. However:

There is no good-for-all solution in this case. If you use the latest firmware, then you are probably better blacklisting the kernel driver, because ACPI will handle it anyway.

silentcreek commented 3 years ago

There is no difference between ACPI definitions of the buttons and leds and the kernel driver. [...] If you use the latest firmware, then you are probably better blacklisting the kernel driver, because ACPI will handle it anyway.

Just to clarify: By kernel driver you mean the module pcengines_apuv2?

miczyg1 commented 3 years ago

Just to clarify: By kernel driver you mean the module pcengines_apuv2?

Indeed

silentcreek commented 3 years ago

@miczyg1 Is the newly released firmware v4.14.0.5 supposed to fix this issue? Going by the release notes, I assumed so, but apparently, it does not.

I installed v4.14.0.5 today. After rebooting, the LEDs stay turned off and there are no sysfs entries for them anymore – which is to be expected, because I had blacklisted the pcengines_apuv2 module as a workaround before. If I undo this blacklisting configuration, the LEDs are functional again, but the kernel error messages from gpio-key-polled show up again. So maybe the ACPI GPIO entries are not the root cause for these error messages?

miczyg1 commented 3 years ago

@silentcreek we will be testing the pcengines_apuv2 again to let you know. @mkopec did the errors occur during the removal of ACPI bindings?

bsdice commented 3 years ago

Hey there, leaving this here for posterity and people finding it via search.

Just received one of the last APU 4D4 right before the Great Shortage. Flashed last available apu4_v4.14.0.5.rom to it. Since I prefer stability and need some patches, I compiled long-term supported kernel 5.4.156 on Arch. With this version LED control is unavailable because of the ACPI GPIO change in apu4_v4.14.0.5.rom vs apu4_v4.14.0.4.rom. Took me a fair bit of digging to figure this out. I recommend on page https://pcengines.github.io/index.html in known issues a new item:

Support for 5.4 will end only in December 2025 according to https://en.wikipedia.org/wiki/Linux_kernel_version_history so a fair bit of life is left in this one. LEDs will show up in /sys/class/leds/ as per usual. I added ledtrig-heartbeat and ledtrig-netdev to /etc/modules-load.d/apu.conf and removed all blacklisted modules in /etc/modprobe.d, they are not needed.

Example:

echo 1 > "/sys/class/leds/apu4:green:led1/brightness" echo heartbeat > "/sys/class/leds/apu4:green:led2/trigger" echo disk-activity > "/sys/class/leds/apu4:green:led3/trigger"

Happy blinking. ;-)

silentcreek commented 3 years ago

@bsdice Have you tried loading the module pcengines_apuv2 or is your custom kernel not compiled with CONFIG_PCENGINES_APU2?

bsdice commented 3 years ago

@bsdice Have you tried loading the module pcengines_apuv2 or is your custom kernel not compiled with CONFIG_PCENGINES_APU2?

@silentcreek I have CONFIG_PCENGINES_APU2=m but it is of no use because if you look at https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/platform/x86/pcengines-apuv2.c?h=linux-5.4.y there is no support for "apu4" in 5.4. There is an even older driver, https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/leds/leds-apu.c?h=linux-5.4.y which supports only APU1. This is also why on kernel 5.4 I get no conflicts when using a APU 4D4 board, simply not supported by the specialized drivers.

For newer kernels indeed it would appear that on BIOS 4.14.0.4 and below one could blacklist pcengines-apuv2 module and rely on ACPI support of leds-gpio. Whereas on BIOS 4.14.0.5, ACPI is no longer supported and you need a more recent kernel >= 5.5 and the updated pcengines-apuv2 module with APU4 support, see https://github.com/torvalds/linux/commit/3d00da1de3ea36ba44f4a7ba76c8c8b16f98204b#diff-c49c6019d36475a9c027f716107d819089d60cb74455e851cad9bf62f4d1e643

S1 seems available through /proc/interrupts on my system and raw GPIO access to switch over SIM slots etc. is also there by using /sys/class/gpio/gpiochip320/ngpio et al.

silentcreek commented 3 years ago

@bsdice Ah, sorry, I completly disregarded the fact that you have a different model, namely the APU4. Then I guess that change between v4.14.0.4 and v4.14.0.5 really did more harm than good and should be reverted after all...

mkopec commented 3 years ago

I have done some testing with the the pcengines-apuv2 and came to the conclusion that the driver does not in fact conflict with the ACPI bindings, and the the error is caused by the gpio-keys-polled driver not implementing the interface that the pcengines-apuv2 driver uses to export the GPIOs. The leds-gpio driver implements this interface and there are no errors from the leds-gpio driver in firmware v4.14.0.4, there are just duplicate entries in /sys/class/leds.

This issue seems to be fixed in the pcengines-apuv2 driver author's personal Linux fork: https://github.com/metux/linux/commits/apu2/defconf-5.5.0/drivers (see the last 3 commits), unfortunately not all of their fixes have been accepted or submitted into the kernel yet.

This, combined with the fact that the driver is only present in recent kernel versions, makes me think that the ACPI bindings should be restored.

@miczyg1 what do you think?

miczyg1 commented 3 years ago

@mkopec in such case we need to revert the changes which removed the ACPI bindings and explain in the documentation that the kernel driver is buggy and shall be blacklisted. Later we may try to update the status of these commits and let people know if these 3 commits work or not.

macpijan commented 2 years ago

The documentation was updated here: https://pcengines.github.io/apu2-documentation/linux_issues/

bsdice commented 2 years ago

Just leaving this here for folks finding this through a search engine...

People running Arch may want to check out my Linux kernel 5.4 build script here: https://seitics.de/files/linux-5.4/

Features:

Contact info is at the top of the domain. I've uploaded a test kernel+include package for Arch so it is easy to try. No warranties. I use it on Arch on APU4D4.

metux commented 1 year ago

Hi,

speaking as the official kernel maintainer of the apu board family driver.

Sorry for replying so late, but i just found this ticket only by accident - obviously nobody here ever considered reporting a Linux kernel issue to the official maintainer.

after upgrading my APU2 from Debian 10 (Kernel 4.19) to Debian 11 (Kernel 5.10) I'm getting a couple of these error messages at boot time: kernel: gpio-keys-polled gpio-keys-polled: unable to claim gpio 0, err=-517

Due lack of exact information on your Kernel configuration, I can only guess:

a) you still got the old (meanwhile removed) apu1 driver running, and these two are conflicting (not sure whether Debian ever enabled the old one). Solution: kernel upgrade or delete the old driver

b) you catched the intermediate BROKEN firmware versions, which added some arbitrary, unthoughtful and incomplete "gpio support" to the ACPI tables. This one highjacks the driver or a completely different device, just similar enough (both have some common ancestor) to make some things work by accident. But this one is DANGEROUS: it hands out the whole address space of what looks like a gpio bank as gpio devices, even though many of the registers are something entirely different, some gpios are used internally in the SoC (chip designers forgot to unwire the registers). Do NOT try to use these gpios, unless you know exactly what you're doing - you may break your hardware ! (that actually happened to me, that's why I wrote the new driver and mapped only the few documented and successfully tested gpios).

I'm also noticing that the first LED stays off which was not the case before.

Works as designed. The kernel should never mess w/ gpios or leds on its, unless there's a special need for that. The LEDs don't have any hardcoded function, the decision is entirely user/application specific.

I found a forum discussion [1] that suggests that this issue should be fixed in Linux 5.3 and that this is likely a module ordering issue. The former can't be true since my kernel is newer than 5.3, obviously.

Yes, there was a problem that leds_gpio or gpio_amd_fch might not have been loaded, when pcengines-apv2 tries to initialize them. Only happens if these are compiled as modules. You can test this by loading them in the correct order manually (command line):

  1. gpio_amd_fch
  2. leds-gpio
  3. pcengines-apuv2

And you should blacklist (or not compile) amd-pinctrl - this is the wrong driver, which might be triggered by broken ACPI tables.

And as for the latter: I tried explicitely loading the mentioned modules leds-gpio and gpio_amd_fch in different orders by adding a cofiguration file in /etc/modules-load.d/ but that didn't work.

Do some gpio or led devices appear w/o loading the apu board driver ? Is there some ACPI device "AMD0030" ?

In this case, you're one of the unlucky ones who got a broken firmware (see above). You should certainly blacklist pinctrl-amd.

--mtx

metux commented 1 year ago

HI, we have just recently noticed the issue. I doubt it is a module ordering issue, because the ACPI GPIOs definitions used by gpio-keys-polled come from the GPIO controller which binds to gpio_amd_fch, so one shouldn't work without the other.

The APU board driver doesn't use APCI at all - and there shouldn't be any gpio ACPI bindings at all, since APCI just can't describe the whole setup in a sane way.

And certainly: DO NOT USE AMD0030 (!!!) - it's the WRONG device, and risk hardware damage.

> I think we have no other choice but to remove the ACPI bindings for the GPIOs The APCI bindings (which were introduced much later than the pcengines board driver - before you that, you folks actually suggest playing dangerous games with raw register access from userland!) have been broken from the beginning and should have never been introduced w/o at least speaking to the kernel maintainer. You broke it, not us.
metux commented 1 year ago

Okay, thanks. I think I understand now... So, we have the modules pcengines_apuv2 and leds_gpio (or gpio_keys_polled?) fighting over the gpio control of these leds.

You have the (stable) apu board driver fighting with broken newer firmware. Downgrade the firmware.

Is there any crucial functionality missing if I blacklist this module?

Depending on firmware version: several gpios (eg. reset lines), different and now suddenly model specific naming of leds, different keycode, ... or maybe loosing it all.

And by the way: this is isn't the correct place for discussing kernel issues - just for reporting broken firmware.

--mtx

bsdice commented 1 year ago

@metux Thanks for the insight. I perceive an aggressive tone though. Why that? I realize this is sometimes needed on LKML and Torvalds has been doing it for decades, but this isn't LKML... ;-) I think you just made people delete their posts because I see comments in my inbox which are no longer here.

Anyhow. APU2 seems EOL because of chip shortages. There hasn't been a Coreboot update since half a year, so I presume a contract with Pascal ran out. Which probably means this ACPI issue will never be fixed in Coreboot PCengines branch. I am inclined to patch it locally and compile one last final ROM for APU4D4 myself. Because userland obviously should never be able to damage hardware.

I also checked kernel 6.2 and somewhere in last months CONFIG_PINCTRL_AMD became bool, so can't make it a module anymore and blacklist it. Would need to disable it entirely. I realize in the field out there basically no-one will be running 6.2.

metux commented 1 year ago

There is no difference between ACPI definitions of the buttons and leds and the kernel driver.

There's a lot. If you had talked to me (or maybe asked Pascal + friends - I had explained it detail), before making such wild hacks, you would have known, that this statement is wrong:

The apu board driver adds only those gpios that are known to be safe, the LED names are not model specific. But the APCI definitions make the LEDs model specific, change the key code, drop the gpio naming, instantiate the wrong driver and expose the whole register bank - including entirely different things - as gpios.

  • the ACPI is better because it will expose the leds and buttons specific to the platform, so the kernel does not need to know which GPIOs on which apu variant is used, e.g. apu5 can be also supported in such way

As said, it names the LEDs in HW specific ways, thus userland needs to change again for every single board model. userland now also needs to be changed for different key code, and we loose other gpio like reset lines.

  • kernel driver will not support apu5 because it lacks GPIO definitions for it,

Support is under way. But since hw/firmware vendor is so uncooperative, not even attempts to talk to the official kernel maintainer, I didn't have this on high priority.

There is no good-for-all solution in this case.

No, that's a very bad solution. As said: existing userland - in the field (or even on high seas) for many years - now needs to be changed. And it's using the WRONG driver.

And no: ACPI just cannot describe the full setup in a sane way.

metux commented 1 year ago

I have done some testing with the the pcengines-apuv2 and came to the conclusion that the driver does not in fact conflict with the ACPI bindings, and the the error is caused by the gpio-keys-polled driver not implementing the interface that the pcengines-apuv2 driver uses to export the GPIOs. The leds-gpio driver implements this interface and there are no errors from the leds-gpio driver in firmware v4.14.0.4, there are just duplicate entries in /sys/class/leds.

Whohoh! You're binding two drivers to the same register space ? Good luck that.

As the mess w/ broken firmware and uncorpative vendor continues, I'm really considering APCI blacklisting.

This, combined with the fact that the driver is only present in recent kernel versions, makes me think that the ACPI bindings should be restored.

This breaks stable distros and make industrial customers unhappy.

Why should anybody still spend so much money for such an unreliable platform instead of some cheap china stuff from Wish or Aliexpress ?

--mtx

metux commented 1 year ago

@metux Please see https://pcengines.github.io/apu2-documentation/linux_issues/ for the rationale as to why the ACPI code for LEDs and GPIO was restored. The error message still appeared for users after we removed the ACPI bindings.

Your "fix" made it worse. Loosing existing functionality and breaking existing userland. And the commits this page mentions are quite unrelated to this.

Why didn't you just talk to me first ? Such things don't work well by a one man show. We have good reasons for our extensive open discussions and review process.

metux commented 1 year ago

@metux Thanks for the insight. I perceive an aggressive tone though.

Actually, I'm quite dissappointed and pissed. You probably don't know the long story behind this, so I give you a short overview:

Back when writing the apu board driver, I already was in contact w/ PCengines (they knew me and my role). There hadn't been any ACPI bindings back then. PCengines still adviced users to play dangerous games with raw register access from userland - very unstable and dangerous. (there only was some experimental, incomplete support in the apuv1 kernel driver, which practically no distro actually enabled - but these two are totally different device, so we split that out).

At that time there was no intent to change anything at PCengies side. The APU board driver (that doesn't use any acpi stuff) went into mainline and was shipped by Distros. So people now got mainline and Distro support for LEDs, front keys, sim switch and pcie/baseband reset. (assuming their hardware was correctly wired as documented in the schematics).

Some time later (too lazy to scan the git log again, when exactly that was), they suddenly added some pieces of ACPI bindings, in a quick shot (w/o even telling us first), changing the LED naming (now model specific) and key code and dropping all the other stuff. And the worst of all: highjacking an ACPI ID for some very different devices, that just happens to share some common ancestor with the actual one (creating potential for HW damage by uncautious users) Creating a situation where - depending on firmware version (!) - where users loose functionality and changing userland ABI - thus breaking existing userland.

And who's gonna have to deal with pissed users (or lots of broken installations in the field) ? Guess who !

I'm really fed up w/ cleaning up the mess of the firmware vendor again and again. Yes, that's not unusual in the x86 / acpi world, since board/firmware vendors like to do fire and forget.

I hoped that PCengines would be different, being the one x86 vendor doing some decent work.

But I'm DEEPLY DISAPPOINTED.

The whole mess could have been prevented several years ago, if somebody there had spent a few mins to pick up the phone and call me (or at least drop a mail). With a bit of cooperation we could have full mainline support for the whole APU line, all models, including the fresh ones.

Which probably means this ACPI issue will never be fixed in Coreboot PCengines branch. I am inclined to patch it locally and compile one last final ROM for APU4D4 myself.

Please do so, and submit a pull request. (sadly I'm too busy for also taking care of FW)

I actually thought about the same, but just have limited range of board models right now, so testing will be tricky. No idea why PCengines didn't just give the official maintainer some examples.

Because userland obviously should never be able to damage hardware.

Indeed. This was exactly the main reason for writing a new gpio driver.

I also checked kernel 6.2 and somewhere in last months CONFIG_PINCTRL_AMD became bool, so can't make it a module anymore and blacklist it.

Yes, that's because it's just meant for some very low level board/soc specific stuff - it doesn't make any sense to load it later (eg. after PCI setup).

Sorry for being noisy, but I have to repeat it again: this is about an entirely different device. Pcengines highjacked the device id of a different device -- thus broke ACPI rules.

Would need to disable it entirely. I realize in the field out there basically no-one will be running 6.2.

For the time being, I don't see much more practical chance except recompiling w/ CONFIG_PINCTRL_AMD.

Maybe (if I'm bored enought), I could add specific blacklisting in APCI core. But that will be last resort, we already have too much of such nasty quirks in the kernel. And at certain point it's really questionable whether we kernel folks should clean up the mess that HW vendors produce, just for free. At certain point we have to say: "sorry folks, not our problem, get some decent HW instead".

bsdice commented 1 year ago

@metux If I may ask, how did you stumble on this issue? Did you wreck a board? Wreck how much? Or did some commercial user wreck one? How is that your problem even?

Comparing an APU2..4 to one of those Made in China boards is too strong for me. Basically nobody does thermal management or signal integrity right. Pascal does, knows his stuff. And no-one has ECC RAM which works properly in an embedded AMD64 platform. Everything made in Asia will just crash instead.

I just built a release 4.19.0.1 for my APU4 using pce-fw-builder on coffee break. Now trying to figure out what I want to get rid of. https://github.com/pcengines/coreboot/commit/bca61d7cc6986d1b1158a45b3513b8af30edf77c and maybe something else.

bsdice commented 1 year ago

The file gpio.asl i.e. AMD0030 as mentioned by you was last touched in summer of 2019 (!) https://github.com/pcengines/coreboot/commits/release/src/mainboard/pcengines/apu2/acpi/gpio.asl

Are you certain that leds.asl and buttons.asl aren't just a minor nuissance which prevent pcengines_apuv2 from loading?

So delete

gpio.asl leds.asl buttons.asl

and also from dsdt.asl. Quite the lobotomization.

metux commented 1 year ago

@metux If I may ask, how did you stumble on this issue? Did you wreck a board? Wreck how much? Or did some commercial user wreck one? How is that your problem even?

After accidentially writing to some regs in the suppsed gpio bank, the board behaved erratic. Don't recall the exact details after all these years, but it was very unstable, funny hangups (guess never seen a running kernel on this one anymore). No idea what really happened, must have hit something deeply internal. (most likely the board went into trash)

What I do recall is the words "undefined" and "reserved" in some AMD FCH specs. Now we know that "reserved" really means "never ever touch it, or you break things".

Comparing an APU2..4 to one of those Made in China boards is too strong for me. Basically nobody does thermal management or signal integrity right. Pascal does, knows his stuff. And no-one has ECC RAM which works properly in an embedded AMD64 platform. Everything made in Asia will just crash instead.

Yes, Asian boards are usually a great deal of fun. If I wasn't under NDA, I could tell funny stories from various industrial places.

I just built a release 4.19.0.1 for my APU4 using pce-fw-builder on coffee break. Now trying to figure out what I want to get rid of. pcengines/coreboot@bca61d7 and maybe something else.

Look for the gpio ASL files.

metux commented 1 year ago

The file gpio.asl i.e. AMD0030 as mentioned by you was last touched in summer of 2019 (!) https://github.com/pcengines/coreboot/commits/release/src/mainboard/pcengines/apu2/acpi/gpio.asl

Are we looking at the same branch ? I've been looking in the latest release, and there was a relatively recent commit that reverted a removal of some gpio stuff from few years ago.

Are you certain that leds.asl and buttons.asl aren't just a minor nuissance which prevent pcengines_apuv2 from loading?

"minor" depends on whether you actually need this stuff and have existing applications written for the mainline kernel.

So delete

gpio.asl leds.asl buttons.asl

and also from dsdt.asl. Quite the lobotomization.

Yep, just drop that ASL and rebuild, should do the trick.

silentcreek commented 1 year ago

Due lack of exact information on your Kernel configuration, I can only guess:

I'm using the default Debian kernel configuration which enables the APU1 and APU2 drivers as modules: CONFIG_LEDS_APU=m CONFIG_PCENGINES_APU2=m That is for Debian 11 with kernel 5.10. On Debian 10 with kernel 4.19, the APU2 driver didn't exists, obviously.

a) you still got the old (meanwhile removed) apu1 driver running, and these two are conflicting (not sure whether Debian ever enabled the old one). Solution: kernel upgrade or delete the old driver

No. The APU1 module is not loaded by default on my system.

b) you catched the intermediate BROKEN firmware versions, which added some arbitrary, unthoughtful and incomplete "gpio support" to the ACPI tables. [...]

Which versions do you mean? Looking at the history of src/mainboard/pcengines/apu2/acpi/, the GPIO bindings were added in 2019, then they were removed in October 2021 and then the removal was reverted in November 2021. Do you mean all releases between July 2019 and today, except for then one between October and November 2021 are broken, intermediate releases?

In any case, yes, I was/am using those firmware releases.

I found a forum discussion [1] that suggests that this issue should be fixed in Linux 5.3 and that this is likely a module ordering issue. The former can't be true since my kernel is newer than 5.3, obviously.

Yes, there was a problem that leds_gpio or gpio_amd_fch might not have been loaded, when pcengines-apv2 tries to initialize them. Only happens if these are compiled as modules. You can test this by loading them in the correct order manually (command line):

1. gpio_amd_fch

2. leds-gpio

3. pcengines-apuv2

Doing this leads to the original kernel error message reported here. But that is, per your description, to due the APCI definitions in the firmware.

And you should blacklist (or not compile) amd-pinctrl - this is the wrong driver, which might be triggered by broken ACPI tables.

The module pinctrl_amd is not loaded on my system by default. So, it doesn't seem to be triggered by these ACPI tables (and I haven't blacklisted it).

Edit: Oh, Debian seems to compile-in the pinctrl_amd driver, rather than building it as a module (CONFIG_PINCTRL_AMD=y).

Do some gpio or led devices appear w/o loading the apu board driver ?

Yes, since I blacklisted the APU2 driver after this discussion here, the leds appear in sysfs. I didn't check other GPIO devices, because I was only interested in controlling the LEDs and that works just fine.

Is there some ACPI device "AMD0030" ?

I don't know the proper way to look for ACPI devices, but dmesg | grep -i acpi doesn't show any AMD0030 device.

Edit: I did find something under /sys/class/gpio: $ cat /sys/class/gpio/gpiochip320/device/modalias acpi:AMD0030:AMD0030:

silentcreek commented 1 year ago

Okay, thanks. I think I understand now... So, we have the modules pcengines_apuv2 and leds_gpio (or gpio_keys_polled?) fighting over the gpio control of these leds.

You have the (stable) apu board driver fighting with broken newer firmware. Downgrade the firmware.

From a user perspective, I think this is a bit shortsighted. The error manifested itself after a kernel upgrade (4.19 to 5.10) which brought a new driver. The firmware or hardware haven't changed and at that point, so it isn't too far-fetched to think it's a kernel regression - again, just as a mere user.

Now, I'm not trying to suggest that you're wrong (especially since I can't make any technical judgements here). I'm just trying to say, I wouldn't paint this as black and white, especially considering, that the first mainline release including your driver and the first firmware release with these ACPI bindings are just 3 months apart.

And by the way: this is isn't the correct place for discussing kernel issues - just for reporting broken firmware.

True, indeed. My bad. But to be fair, I think 3mdeb was in general really helpful with all sorts of issues, and unfortunately, I can't say I've always gotten replies to bug reports sent to LKML or Kernel Bugzilla.

silentcreek commented 1 year ago

So delete gpio.asl leds.asl buttons.asl and also from dsdt.asl. Quite the lobotomization.

Yep, just drop that ASL and rebuild, should do the trick.

So, I built a firmware with these three asl files removed, and their respective include statements removed from dsdt.asl as well. Now, when I boot my system (Debian 11 with Linux 5.10), I get the original kernel error messages again!

I checked that the APU1 driver is not loaded and I can also see that the gpiochip with the alias "acpi:AMD0030:AMD0030:" is gone (there is a different gpiochip now with the label gpio_amd_fch). In addition, the leds show up with different names now in sysfs and there are no duplicate entries, leading me to believe that removing the ACPI bindings worked.

So, I checked to see if the modules are loaded in the correct order, but that doesn't make a difference either.

You can test this by loading them in the correct order manually (command line):

1. gpio_amd_fch

2. leds-gpio

3. pcengines-apuv2

Even if I remove the modules and load them manually in this order, the error message shows up.

Any ideas how to resolve this?

silentcreek commented 1 year ago

@metux The more I think back to the history of this issue, the more I believe we're missing something here.

Due lack of exact information on your Kernel configuration, I can only guess: [...] b) you catched the intermediate BROKEN firmware versions, which added some arbitrary, unthoughtful and incomplete "gpio support" to the ACPI tables.

This is actually not causing the kernel errors I reported. It does not matter which firmware you use. The error still happens. After reading through this issue here again, I noticed that I actually did try a firmware release where 3mdeb removed the ACPI GPIO definitions (v4.14.0.5) back in October 2021. I still got the kernel errors. So, it's no surprise that my newly custom built firmware without the APCI bindings doesn't solve the problem either.

We're still missing the root cause for the kernel errors, it seems. (Now, again, I'm not trying to suggest, that your statement is wrong, that the ACPI bindings shouldn't be in the firmware, just, that it's off the point with regards to solving the issue at hand.)

bsdice commented 1 year ago

I tried to fix the error as well (Linux kernel 6.2.1, latest firmware lobotomized) but failed and ran out of time. Switched back to v4.16.0.1 03/07/2022 on APU4D4, which I think is one of the most stable for me. Updated to kernel 5.4.233 with 1004-gpio_keys_polled.patch and 1004-gpio_keys_polled.patch patches as mentioned, also with option CONFIG_PINCTRL_AMD=y to eliminate the driver race. Nothing blacklisted. Not using any GPIOs and I also don't care if LED names are platform-agnostic. ;-)

iam-TJ commented 1 month ago

Possibly late to the party but I fixed this the "easy" way by simply loading a modified DSDT at boot-time. Debian and many other distro kernel configs enable CONFIG_ACPI_TABLE_UPGRADE=y (see https://www.kernel.org/doc/html/latest/admin-guide/acpi/initrd_table_override.html )

The process is pretty straight-forward:

# copy the loaded DSDT
sudo cat /sys/firmware/acpi/tables/DSDT | dd of=dsdt.dat
# decompile
iasl -d dsdt.dat
# remove the GPIO, LED, and Button devices (and increment revision)
gawk '/DefinitionBlock/ {revision=strtonum($7) + 1; $7=sprintf( "0x%08x)", revision)} S==1 {brace=gensub(/{/,"}",1); S=2} /Scope \(_SB\.PCI0\)/{S=1} /Scope \(PCI0\)/{S=1} S==0 {print} brace==$0 {S=0}' dsdt.dsl >dsdt.modified.asl
# review the change
diff -u dsdt.dsl dsdt.modified.asl | less
# compile
iasl -sa dsdt.modified.asl

On Debian (using initramfs-tools) NOTE Debian functionality is currently broken and does NOT work as advertised. I've published a merge request to fix it but is unlikely to back-port to stable. My patch branch can be found at https://salsa.debian.org/Iam_Tj/initramfs-tools/-/tree/fix-ACPI-DSDT

sudo cp ./dsdt.modified.aml /etc/initramfs-tools/DSDT.aml
sudo update-initramfs -u -v

Confirmation appears at the end of the verbose (-v) report:

Adding DSDT /etc/initramfs-tools/DSDT.aml

Only when NOT using patched Debian initramfs-tools integrated DSDT load support:

# create the required initrd layout
mkdir --parents kernel/firmware/acpi
cp dsdt.modified.aml kernel/firmware/acpi/dsdt.aml
# create the ACPI over-ride initrd
find kernel -type d -o -type f -name '*.aml' | cpio -H newc --create > initrd.acpi.img
# capture name of **EXISTING** initrd.img (naming varies across distros - this is Debian 12 with a symbolic link to the default)
i=$(realpath -e /boot/initrd.img); echo $i
# **move** to a back-up
sudo mv "${i}" "${i}.bak"
# concatenate default onto ACPI over-ride and replace default initrd
sudo cat ./initrd.acpi.img "${i}.bak" | sudo dd of="${i}"

If all goes well on reboot the kernel should report something similar to:

apu2 kernel: ACPI: Table Upgrade: override [DSDT-COREv4-COREBOOT]
apu2 kernel: ACPI: DSDT 0x00000000CFE9C280 Physical table override, new table: 0x00000000CFE87000
apu2 kernel: ACPI: DSDT 0x00000000CFE87000 00173B (v02 COREv4 COREBOOT 00010002 INTL 20200925)
...
apu2 kernel: ACPI: DSDT ACPI table found in initrd [kernel/firmware/acpi/dsdt.aml][0x173b]

And now the generic naming of the items ("apu" not "apu2") is in use so any scripts that manipulate them will need correcting:

$ ls /sys/class/leds/
apu:green:1  apu:green:2  apu:green:3  ath10k-phy0  mmc0::
pietrushnic commented 3 weeks ago

Thanks. Let me discuss with the team the correct approach here. Then, we can extend the validation suite for Dasharo (coreboot+SeaBIOS) to check if this method works on every release. It will likely not happen in the upcoming one, but it will happen in the next one.