pcengines / apu2-documentation

Documentation and scripts for building and adjusting PC Engines APU2 firmware
https://pcengines.github.io/apu2-documentation/
208 stars 45 forks source link

CPU clock stuck at low frequency (APU2C4, v4.10.0.0, v4.11.0.6, v4.13.0.2) #248

Open thowe-switzerland opened 3 years ago

thowe-switzerland commented 3 years ago

Basically my APU2C4 still seems to suffer from this (closed) problem: https://github.com/pcengines/coreboot/issues/196

Documentation: https://github.com/pcengines/apu2-documentation/blob/master/docs/debug/cpu_frequency.md

After a reboot, the CPU works for some time (a few hours) normally and also goes up to the nearly 1000MHZ. But after some time the maximum frequency is stuck at 600MHz again until the next reboot.

Turbostat

pkg add http://pkg0.isc.freebsd.org/FreeBSD:12:amd64/latest/All/turbostat-4.17_2.txz rehash kldload cpuctl turbostat --interval 3

turbostat version 17.06.23 - Len Brown lenb@kernel.org CPUID(0): AuthenticAMD 13 CPUID levels; family:model:stepping 0xf:30:1 (15:48:1) CPUID(1): SSE3 MONITOR - - - TSC MSR - - CPUID(6): APERF, No-TURBO, No-DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, No-EPB CPUID(7): No-SGX NSFOD /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ - - 208 34.75 599 998 815 0 0 219 36.55 598 997 85 1 1 207 34.52 599 998 581 2 2 203 33.91 599 998 77 3 3 204 34.03 599 998 72

Stress test also results in low bogo ops:

# stress-ng --cpu 4 --cpu-method matrixprod --timeout 30 --metrics stress-ng: info: [55320] dispatching hogs: 4 cpu stress-ng: info: [55320] successful run completed in 30.09s stress-ng: info: [55320] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s stress-ng: info: [55320] (secs) (secs) (secs) (real time) (usr+sys time) stress-ng: info: [55320] cpu 988 30.06 75.13 0.05 32.87 13.14

Hardware: APU2C4 Verified and affected BIOS versions (basically all versions I tried):

More aspects:

Ideally, the problem can be solved with a BIOS fix. (My hope) Workaround: Are there ways to prevent the CPU from going down to the 600MHz? Preferably without having to turn off the Power Boost? Or do I have to switch the APU board?

miczyg1 commented 3 years ago

@krystian-hebel any insights about it?

krystian-hebel commented 3 years ago

@thowe-switzerland 1) Is this APU in official case? I assume it is, but asking just to confirm.

2) Does this bug occur with CPU Boost disabled? Are there any other unexpected hangs, problems with reboot etc? We had some reports of CPB not working reliably on some platforms, while being perfectly fine on the others. This is why we introduced an option to turn it off, for a couple of releases we had it fixed as enabled. This most likely is a hardware issue for a specific unit, and not a model as a whole. Even when talking about straight overclocking in PC CPUs you can find "better" and "worse" units of the same CPU model, you just have to get lucky. Unfortunately, all of our platforms in the lab do not have these issues so we are forced to rely on users' reports only...

3) Do you have any unusual peripheral devices, storage, multiple WiFi adapters?

4) Have you tested if a non-BSD OS behaves the same way?

thowe-switzerland commented 3 years ago

@krystian-hebel

Thanks for having a look at it.

  1. It is in the original case by PC Engines (a red one ;-) ). Core temperatures all stay below 63 degrees celsius - which is normal.

  2. I did not test this. Maybe I will give it a try.

  3. No special hardware. Only the standard 16GB SSD original by PC Engines.

  4. I did not test with another OS as the hardware is a firewall in production.

I still believe in the APU. So today I ordered a new APU2E4 system (with blue case and 30GB SSD) from PC Engines. As soon as this is operational and replaces the APU2C4 I can do tests (CPB disabled, other OS).

thowe-switzerland commented 3 years ago

New APU2E4 arrived today, is assembled, configuration is transferred, firewall is already in production. I will have an eye on that one regarding low frequency. So far operating normal and snappy.

This also means, that the affected older APU2C4 is currently a spare device and ready to execute some tests as requested by @krystian-hebel. I will get back with answers to the questions as soon as I have them.

thowe-switzerland commented 3 years ago

Observations so far:

So the effect seems not to be related to that one specific APU2C4 board but a general issue - at least in this specific software/load configuration.

I am still trying to provoke the issue at the old (now isolated) APU2C4 in order to understand what circumstances have to be met in order to get into this 600MHz state. Without NICs active (no IRQ on the CPUs) I could not provoke it so far.

thowe-switzerland commented 3 years ago

Current Turbostat of the new APU2E4 (v4.13.0.2, CPU Boost enabled, OPNsense):

root@router:/home/weth # turbostat --interval 3 turbostat version 17.06.23 - Len Brown lenb@kernel.org CPUID(0): AuthenticAMD 13 CPUID levels; family:model:stepping 0xf:30:1 (15:48:1) CPUID(1): SSE3 MONITOR - - - TSC MSR - - CPUID(6): APERF, No-TURBO, No-DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, No-EPB CPUID(7): No-SGX NSFOD /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ - - 325 54.32 599 998 354 0 0 225 37.59 599 998 95 1 1 331 55.26 599 998 114 2 2 418 69.77 599 998 88 3 3 327 54.67 599 998 57 Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ - - 455 75.99 599 998 322 0 0 406 67.87 599 998 98 1 1 506 84.42 599 998 118 2 2 496 82.79 599 998 71 3 3 413 68.89 599 998 35

thowe-switzerland commented 3 years ago

Interim Report

My plan is to provoke the issue on the spare system because it is not well received by users if I constantly experiment on the production firewall. I'll try to put a simulated network load on it.

krystian-hebel commented 3 years ago

@thowe-switzerland thanks for reports. The fact that the issue doesn't happen without NIC activity may explain why we weren't able to reproduce this on our side. I'll ask our validation team if we can do something about it, but I can't promise we will be able to start testing before weekend, platforms will be needed for testing new release soon.

thowe-switzerland commented 3 years ago

I tested with Core Performance Boost disabled. And it seems to make no difference: Initially CPU went up to 1GHz. After some hours it was stuck at 600MHz. I did not expect this. So I re-checked BIOS settings, and yes CPU Boost really is disabled.

thowe-switzerland commented 3 years ago

In the OPNsense forum we try to get some more light into the darkness. In the process, we have discovered:

  1. Other users are also affected.
  2. The problem does not seem to occur when the powerd of FreeBSD is either completely switched off or running on the profile "Maximum".
miczyg1 commented 3 years ago

@krystian-hebel Ad.2. doesn't that mean the CPU driver/frequency scaler in FreeBSD is faulty? The profile Maximum should be equivalent on running the highest P-state, so no switching to lower P-states (and thus lower frequencies).

thowe-switzerland commented 3 years ago

Is not excluded from my point of view.

What irritates me a bit, however, is the fact that it sometimes runs flawlessly for a few hours. And in the lab without NIC connected, the problem does not occur at all. Although I create strongly fluctuating load with stressng.

krystian-hebel commented 3 years ago

@miczyg1 depends how the driver defines "maximum". If it were aware of CPB then throttling down other cores may result in higher performance when there is only one CPU-heavy process. I think we may assume that it behaves as you described, with no software P-state switching whatsoever.

Now we have to think what may be the common denominator of powerd and NIC. Perhaps an interrupt in a bad place in powerd code messes something up, but this is only a guess at this point.

miczyg1 commented 3 years ago

@thowe-switzerland with the recent v4.14.0.1 we have fixed some issues related to CPU boost and C-states which may help with the problem of idling CPUs and stuck frequencies. It should also improve the stability of the BSD systems. Please try to install the latest (v4.14.0.1) firmware and let me know if the problem persists.

thowe-switzerland commented 2 years ago

Thanks for the tip. I apologize that I haven't gotten around to putting this BIOS into production yet. And only in production I can reproduce the problem. In test environment I never succeeded, probably because the traffic pattern is just completely different...

I hope to have an opportunity to update the BIOS soon. I will report then.

For the time being, powerd on "maximum" works fine.

thomaswenger commented 2 years ago

First: I apologize that I could only now test the new BIOS.

I updated the BIOS to the most current version yesterday. Since then, the firewall in question has been running with powerd on highadaptive - no problems. I.e. the CPU frequency oscillates nicely up to 1GHz without problems.

I currently assume that the problem is solved. Can you tell what exactly caused the problem?

If, contrary to expectations, I see the problem again, I would post it here. Currently it looks good.

Many thanks & best regards Thomas

miczyg1 commented 2 years ago

I currently assume that the problem is solved. Can you tell what exactly caused the problem?

coreboot didn't include the core C6 (CC6) save state memory in the memory map. OS could accidentally access this memory and overwrite core states. CC6 is required for CPU boost to work and is a lower power state for a core.

If, contrary to expectations, I see the problem again, I would post it here. Currently it looks good.

Many thanks & best regards Thomas

I hope the issue is really resolved. Let's wait a few more days and then close the issue if nobody will raise problems. It can be reopened at any time.

thomaswenger commented 2 years ago

Perfect, thanks a lot!

The APU is now running for 2 days and adapts perfectly to the load. And the CPU can be clocked up to the maximum frequency again.

You have done well!

mkopec commented 2 years ago

Judging from the lack of new messages in this thread, I think it's safe to assume that the problem has been solved. Closing this issue.

thowe-switzerland commented 2 years ago

Thank you for asking.

Originally it effectively looked like the problem was solved for quite some time.

To be sure, I just logged into the firewall again and looked.

Unfortunately, I find that despite new firmware, the problem is still there.

root@router:~ # turbostat --interval 3 turbostat version 17.06.23 - Len Brown lenb@kernel.org CPUID(0): AuthenticAMD 13 CPUID levels; family:model:stepping 0xf:30:1 (15:48:1) CPUID(1): SSE3 MONITOR - - - TSC MSR - - CPUID(6): APERF, No-TURBO, No-DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, No-EPB CPUID(7): No-SGX NSFOD /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ - - 243 40.65 599 998 649 0 0 185 30.91 599 998 195 1 1 184 30.68 599 998 119 2 2 348 58.10 599 998 204 3 3 257 42.92 599 998 131 Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ - - 243 40.53 599 998 880 0 0 179 29.97 599 998 190 1 1 187 31.21 599 998 258 2 2 301 50.21 599 998 282 3 3 304 50.74 599 998 150 ^C

thowe-switzerland commented 2 years ago

Addition. This happens with:

BIOS Vendor | coreboot Version | v4.14.0.6 Release Date | 11/04/2021

bsdice commented 2 years ago

I have the following APU4D4 for tests:

_DMI: PC Engines apu4/apu4, BIOS v4.15.0.2 12/27/2021 Linux version 5.4.177-1-lts54-custom (linux-lts54-custom@archlinux) (gcc version 11.1.0 (GCC)) #1 SMP Tue, 08 Feb 2022 13:32:57 +0000 Command line: BOOT_IMAGE=/boot/vmlinuz-linux-lts54-custom root=UUID=fd486b41-3f47-49de-a1ea-b308fd4e14ec rw console=ttyS0,115200n8 audit=0 iomem=relaxed amd_iommu=off spec_store_bypass_disable=prctl spectre_v2user=prctl

and in the following configuration:

PXL_20211104_092430499

stress-ng: info:  [27664] setting to a 30 second run per stressor
stress-ng: info:  [27664] dispatching hogs: 4 cpu
stress-ng: info:  [27664] successful run completed in 30.06s
stress-ng: info:  [27664] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per
stress-ng: info:  [27664]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)
stress-ng: info:  [27664] cpu                1888     30.03    118.84      0.23        62.87          15.86        99.13

After ~7 days of uptime with light load Bzy_MHz goes to ~945 MHz and a stress-ng run for 120 seconds delivers 7620 bogo ops, which is around 4x of 30 seconds. Here is the output of lm-sensors:

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +60.0°C  (high = +70.0°C)
                       (crit = +105.0°C, hyst = +104.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +60.0°C  (crit = +115.0°C)

fam15h_power-pci-00c4
Adapter: PCI adapter
power1:       12.62 W  (interval =   0.01 s, crit =   6.00 W)

I don't seem to lose the 600/800/1000 MHz steps.

thowe-switzerland commented 2 years ago

I also can't say exactly when the problem will occur yet. I have not been able to create a synthetic setup where the problem also occurs.

But sooner or later the problem will occur on a production OPNsense (based on FreeBSD) with Powerd enabled on Highadaptive. The problem does not occur when Powerd is disabled or Maximum is selected as mode.

miczyg1 commented 2 years ago

I assume Maxiumum mode means: run always at 1000MHz (highest CPU P-state) right?

thowe-switzerland commented 2 years ago

Yes. I would expect that. And it obviously does.