pcengines / apu2-documentation

Documentation and scripts for building and adjusting PC Engines APU2 firmware
https://pcengines.github.io/apu2-documentation/
208 stars 45 forks source link

2 x APU2E4 unstable with CPB enabled. #251

Open MyGithubUser01 opened 3 years ago

MyGithubUser01 commented 3 years ago

Hi all,

I'm having some serious stability issues with APU2E4 and CPB with BIOS 4.13.0.1 and 4.13.0.5 This is brand new hardware which was believed to be "DOA" but the replacement I got had the exact same issue. After disabling CPB the system appears to be stable and has an uptime of a record high 4 days and going.

Operating system tested OPNsense 21.1 and 21.1.5. I've tried booting from msata, sd card and USB but it gives me the same issue. I've also tried multiple power adapters. The CPU Temperature is typically in the range 54-56c and the system isn't even connected to any network just the console cable.

The system has been very unstable and is core dumping every 4-12 hours. BIOS 4.13.0.1, but did see similar issues when testing 4.13.0.5.

From the logs/console I see the following:

FreeBSD/amd64 (OPNsense.localdomain) (ttyu0) login: MCA: Bank 1, Status 0x9400000000000151 MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0 MCA: CPU 0 COR ICACHE L1 IRD error MCA: Address 0x282060 [HBSD SEGVGUARD] [/usr/local/bin/python3 (5880)] Suspension expired. -> pid: 5880 ppid: 1302 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>

And: "root@OPNsense:/var/db/rrd # MCA: Bank 1, Status 0xd400000000000151 MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0 MCA: CPU 0 COR OVER ICACHE L1 IRD error MCA: Address 0xffff80d1ff60"

Let me know if additional details are required. Broken hardware, bios bug, OPNsense/HardenedBSD compatibility issues?

miczyg1 commented 3 years ago

The CPB is additionally enabling core/package C6 states. I have recently discovered some bugs in coreboot around the C6 and its save state area in DRAM. It may be causing problems when CPB is enabled. The patches to coreboot are already sent, so I can test if those resolve your issue. If I understood correctly there is no need for stressing the firewall device to trigger this?

MyGithubUser01 commented 3 years ago

Thank you very much for the feedback, this is correct there is no need to run anything on the firewall. The most stressful thing I've been running is "top".

I don't even have network cables attached to the firewall.

miczyg1 commented 3 years ago

@MyGithubUser01 I have left OPNsense 21.1 installer running on apu2 over a night yesterday (20 hours elapsed since I left the machine idling) with CPB enabled. Not a single MCA error on the serial console with apu2 v4.14.0.1 which contains fixes I have mentioned in the previous comment. Could you please give v4.14.0.1 a try? Let me know if it helps in your case

MyGithubUser01 commented 3 years ago

Hi,

Thank you very much for the update, I've now updated to 4.14.0.1 and made sure CPB is enabled (looks like it was enabled after flashing). I started the firewall about 24h ago with Serial console and WAN connected, but this morning I only had 5h of uptime and found the below in the console/log. This means it happened after 16-20h.

Looks like it doesn't happen as often/frequent as before - but I'm not sure.

WARNING: attempt to domain_add(netgraph) after domainfinalize() pid 26616 (python3.7), jid 0, uid 0: exited on signal 11 (core dumped) [HBSD SEGVGUARD] [/usr/local/bin/python3 (88561)] Suspension expired. -> pid: 88561 ppid: 82917 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT> MCA: Bank 1, Status 0x9400000000000151 MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0 MCA: CPU 0 COR ICACHE L1 IRD error MCA: Address 0xffff811f7c40 pid 95751 (python3.7), jid 0, uid 0: exited on signal 10 (core dumped) [HBSD SEGVGUARD] [/usr/local/bin/python3 (74746)] Suspension expired. -> pid: 74746 ppid: 65084 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT> MCA: Bank 1, Status 0x9400000000000151 MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0 MCA: CPU 0 COR ICACHE L1 IRD error MCA: Address 0x63b4b76aae0

miczyg1 commented 3 years ago

Somehow I cannot reproduce it and we never run into MCA erros before.

The warning WARNING: attempt to domain_add(netgraph) after domainfinalize() looks suspicious. I have found a similar issue here: https://forum.opnsense.org/index.php?topic=17417.0 Maybe following this thread could help you a bit?

v1k4 commented 3 years ago

Brand new APU4D4 here. Crashing multiple times a day when CPB enabled. FW versions v4.14.0.1 and v4.14.0.2.

Currently running Proxmox on Buster and I get many this kind of errors before APU will eventually end up in hanging/crashing.

mce: [Hardware Error]: Machine check events logged [Hardware Error]: Corrected error, no action required.

[Hardware Error]: Error Addr: 0x0000ffff86b0b9e0 [Hardware Error]: MC1 Error: Data/tag array parity error for a tag hit. [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

Disabling CPB seems to make it stable for now.

MyGithubUser01 commented 3 years ago

This sounds very similar to what I'm experiencing with CPB enabled, thanks for chiming in. The same address is mentioned: 0x9400000000000151

Are these all related? https://forum.netgate.com/topic/156761/page-fault-while-in-kernel-mode-on-apu2-after-bios-coreboot-upgrade/4 https://forum.netgate.com/topic/156830/could-you-help-me-analyze-these-crashdumps/5

miczyg1 commented 3 years ago

0x9400000000000151 is not the address but the actual 64bit MC1_STATUS register content. It simply means that the same error was triggered, but the address still may be different. The address where it was triggered is present in this line [Hardware Error]: Error Addr: 0x0000ffff86b0b9e0, the error code is decoded in the following lines.

Still what is written in the forum is not exactly true. CPU Boost does not raise the memory clock frequency, it can't do that because it would require retraining the memory to the new frequency (only BIOS can train the memory, once it is done, the memory frequency is fixed).

CPB is not an overclocking feature! It simply raises the CPU clock frequency to the limits allowed by the CPU specification. Overclocking would be to go higher than what CPB provides (i.e. higher than 1400MHz).

toredash commented 2 years ago

Wanted to chime in this. I have had the same stability issues described here with my APU2E4 for many years. It would work sometimes for weeks at a time, then it would have stability issues daily over a period of several weeks, then back to a few weeks between each failure.

I have disable CBP now, and so far it looks good. But it is only been a week, so I will have to wait a few months to really be sure.

I've found others that points to the same thing, that CBP causes issues:

https://forum.netgate.com/topic/156761/page-fault-while-in-kernel-mode-on-apu2-after-bios-coreboot-upgrade/38 https://www.reddit.com/r/homelab/comments/lokgyg/solution_to_pc_engines_apu2e4_having_constant/

edit: 37d uptime and no issues encountered after CBP was disabled

toredash commented 2 years ago

After a power-outage, my AP2E4 started to misbehave again, randomly locking up.

Had to check if CPB for some reason had been re-enabled, and for sure, it was Enabled again.

I'll report back if it is still stable.

mkopec commented 2 years ago

Hi @toredash , did you experience any more lock-ups since disabling CPB?

ghost commented 2 years ago

Apologies for a "me too" comment but unfortunately for me, the described symptoms in this issue are also somewhat occurring with my APU2E4. Alas, I'm long past the warranty period as I've purchased mine in the summer of 2020. I have not tested with Linux since I am mainly running pfSense on this unit. This occurs with 2.6.0 and a few previous versions.

I've only experienced it randomly crashing and restarting a few times, but since this unit is operating as a firewall for my home connection, I'd rather turn off CPB to make the unit stable again than deal with the instability and MCA errors. If at all possible, I would gladly appreciate a fix to this issue as I could use the extra performance to handle bursty traffic flows since I have a gigabit internet connection. I am willing to turn CPB on again and offer my help in debugging the problem.

As mentioned just above, I also see the MCA errors with CPB enabled but unfortunately I don't have the logs anymore (I've stumbled across this issue randomly when reading the documentation for something unrelated), but they appear very similar and the error seemed to have been associated with CPU2 in my case. With CPB off, I do not see them ever appear in the syslog. I've since then done a fresh install and why I don't have the logs anymore.

I do however, still see errors in the logs of pfSense that are seemingly related to the firmware and the first i210 NIC and I don't know if it's related.

May 19 22:43:04     kernel      igb2: netmap queues/slots: TX 4/1024, RX 4/1024
May 19 22:43:04     kernel      igb2: Ethernet address: <REDACTED>
May 19 22:43:04     kernel      igb2: Using MSI-X interrupts with 5 vectors
May 19 22:43:04     kernel      igb2: Using 4 RX queues 4 TX queues
May 19 22:43:04     kernel      igb2: Using 1024 TX descriptors and 1024 RX descriptors
May 19 22:43:04     kernel      igb2: NVM V0.6 imgtype5
May 19 22:43:04     kernel      igb2: <Intel(R) I210 Flashless (Copper)> port 0x3000-0x301f mem 0xd0200000-0xd021ffff,0xd0220000-0xd0223fff irq 36 at device 0.0 on pci3
May 19 22:43:04     kernel      pci3: <ACPI PCI bus> on pcib3
May 19 22:43:04     kernel      pcib3: <ACPI PCI-PCI bridge> irq 27 at device 2.4 on pci0
May 19 22:43:04     kernel      igb1: netmap queues/slots: TX 4/1024, RX 4/1024
May 19 22:43:04     kernel      igb1: Ethernet address: <REDACTED>
May 19 22:43:04     kernel      igb1: Using MSI-X interrupts with 5 vectors
May 19 22:43:04     kernel      igb1: Using 4 RX queues 4 TX queues
May 19 22:43:04     kernel      igb1: Using 1024 TX descriptors and 1024 RX descriptors
May 19 22:43:04     kernel      igb1: NVM V0.6 imgtype5
May 19 22:43:04     kernel      igb1: <Intel(R) I210 Flashless (Copper)> port 0x2000-0x201f mem 0xd0100000-0xd011ffff,0xd0120000-0xd0123fff irq 32 at device 0.0 on pci2
May 19 22:43:04     kernel      pci2: <ACPI PCI bus> on pcib2
May 19 22:43:04     kernel      pcib2: <ACPI PCI-PCI bridge> irq 26 at device 2.3 on pci0
May 19 22:43:04     kernel      igb0: netmap queues/slots: TX 4/1024, RX 4/1024
May 19 22:43:04     kernel      igb0: Ethernet address: <REDACTED>
May 19 22:43:04     kernel      igb0: Using MSI-X interrupts with 5 vectors
May 19 22:43:04     kernel      igb0: Using 4 RX queues 4 TX queues
May 19 22:43:04     kernel      igb0: Using 1024 TX descriptors and 1024 RX descriptors
May 19 22:43:04     kernel      igb0: NVM V0.6 imgtype5
May 19 22:43:04     kernel      igb0: <Intel(R) I210 Flashless (Copper)> mem 0xd0000000-0xd001ffff,0xd0020000-0xd0023fff irq 28 at device 0.0 on pci1
May 19 22:43:04     kernel      pci1: <ACPI PCI bus> on pcib1
May 19 22:43:04     kernel      pcib1: failed to allocate initial I/O port window: 0x1000-0x1fff
May 19 22:43:04     kernel      pcib1: <ACPI PCI-PCI bridge> irq 25 at device 2.2 on pci0
May 19 22:43:04     kernel      pci0: <base peripheral, IOMMU> at device 0.2 (no driver attached)
May 19 22:43:04     kernel      pci0: <ACPI PCI bus> on pcib0
May 19 22:43:04     kernel      pcib0: could not evaluate _ADR - AE_NOT_FOUND
May 19 22:43:04     kernel      pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 

The relevant lines/errors:

May 19 22:43:04 kernel pcib1: failed to allocate initial I/O port window: 0x1000-0x1fff

and

May 19 22:43:04 kernel pcib0: could not evaluate _ADR - AE_NOT_FOUND

Again, I don't know if the above lines/errors are relevant to the instability issue with CPB at hand.

toredash commented 2 years ago

Hi @toredash , did you experience any more lock-ups since disabling CPB?

No, my device has been stable since CPB was disabled.

daduke commented 2 years ago

FTR I've also had bad stability issues with an apu6 that seem to be solved by disabling CPB. Maybe CPB shouldn't be enabled by default?

toredash commented 2 years ago

Update: My device is still stable after several months.

toredash commented 1 year ago

v4.14.0.6

On Tue, 11 Oct 2022 at 09:46, kurtselbach @.***> wrote:

Perfect, thanks for the update. Got two unused APU2E4 in a drawer...

Which BIOS version are you running now?

On Sun, Sep 4, 2022, 22:40 Tore @.***> wrote:

Update: My device is still stable after several months.

— Reply to this email directly, view it on GitHub < https://github.com/pcengines/apu2-documentation/issues/251#issuecomment-1236411622 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AHZSEG6CDKAJSSLAONEXD5LV4UCK7ANCNFSM5GGOH5ZA

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/pcengines/apu2-documentation/issues/251#issuecomment-1274238362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACMFZWO3OLGCWRT2R3ANZ3WCULNNANCNFSM5GGOH5ZA . You are receiving this because you were mentioned.Message ID: @.***>

mdickers47 commented 1 year ago

FWIW, I have a very old APU2 that became extremely unstable with CPB and Linux 6.2.5. It generates a lot of different "null pointer dereference," "unable to handle page fault," and "soft lockup" panics. It lasts no more than a few hours per reboot, and sometimes only a few minutes. Disabling CPB in the BIOS seems to have solved it.

Here is one of the common panics:

[ 4238.591613] BUG: kernel NULL pointer dereference, address: 0000000000000003
[ 4238.598622] #PF: supervisor write access in kernel mode
[ 4238.603856] #PF: error_code(0x0002) - not-present page
[ 4238.609046] PGD 106842067 P4D 106842067 PUD 103a51067 PMD 0 
[ 4238.614744] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 4238.619132] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.5-arch1-1 #1 fcf70e9d97e045884ea945a3d5b5ff73b06f7a27
[ 4238.629245] Hardware name: PC Engines apu2/apu2, BIOS v4.19.0.1 01/31/2023
[ 4238.636132] RIP: 0010:psi_group_change+0x2f/0x400
[ 4238.640906] Code: 41 57 48 63 c6 49 89 ff 41 56 41 55 41 54 41 89 cc 55 53 48 83 ec 20 48 8b 5f 30 48 03 1c c5 c0 da eb ab 4c 89 04 24 83 03 01 <44> 89 4c 24 10 48 89 44 24 08 f6 c2 10 0f 85 ea 02 00 00 f6 c1 10
[ 4238.659677] RSP: 0018:ffff9b4e800d3dc0 EFLAGS: 00010002
[ 4238.664929] RAX: 0000000000000003 RBX: ffffbb4e7fd81dc0 RCX: 0000000000000010
[ 4238.672104] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8c0a40d61800
[ 4238.679258] RBP: 0000000000000003 R08: 000003dadfbf0c5d R09: 0000000000000001
[ 4238.686442] R10: 0000000000000001 R11: 0000000000000100 R12: 0000000000000010
[ 4238.693593] R13: 0000000000000003 R14: ffff8c0a46408000 R15: ffff8c0a40d61800
[ 4238.700744] FS:  0000000000000000(0000) GS:ffff8c0a6ad80000(0000) knlGS:0000000000000000
[ 4238.708852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4238.714618] CR2: 0000000000000003 CR3: 0000000106eb0000 CR4: 00000000000406e0