bcm2835-power: Timeout waiting for grafx power OK

lategoodbye commented 5 years ago

Describe the bug Starting with Linux 5.1 there is a new power driver for BCM2835. The idea behind this is to have a better control about the V3D power domain. After rollout i got informed that some RPI boards (currently a handfull) have issues during enabling the V3D power domain. The ramp-up runs into a timeout (20 us), because we never get a PM_POWOK. I don't have a clue what causes this issue (timing, hardware tolerance, ...). Currently i don't have a board, which is affected.

To reproduce start the RPI with Mainline Kernel 5.1

Expected behaviour bcm2835-power succeeded to enable V3D power domain

Actual behaviour bcm2835-power failes to enable V3D power domain because PM_POWOK stays off

System

Which models of Raspberry Pi? RPI 2, RPI 3B and RPI 3B+
Which firmware version (vcgencmd version)? 2019-02-12, 2019-03-27
Which kernel version (uname -a)? Mainline Kernel / DTB 5.1

Logs

[   13.913771] bcm2835-power bcm2835-power: Timeout waiting for grafx power OK
[   13.918555] bcm2835-power bcm2835-power: Timeout waiting for grafx power OK

More info: https://github.com/anholt/linux/issues/153

Additional context Add any other relevant context for the problem.

warpme commented 5 years ago

Just another data-point: I built https://github.com/raspberrypi/linux/tree/rpi-5.2.y and I'm getting bcm2835-power: Timeout waiting for grafx power OK on my rpi2-b.

redchenjs commented 5 years ago

model:

RPI 3B

firmware version:

2019-07-15 17:34

kernel version:

5.2.2-1-ARCH #1 SMP Sun Jul 21 19:53:44 UTC 2019 aarch64 GNU/Linux

kernel logs:

[    6.514813] bcm2835-power bcm2835-power: Timeout waiting for grafx power OK
[    6.524622] bcm2835-power bcm2835-power: Timeout waiting for grafx power OK

The VC4 driver was loaded but no GPU hardware was detected.

xnorbt commented 5 years ago

I'm getting the same issue with RPi 3B+, Arch Linux aarch64, Kernel 5.2.10-1-ARCH. No GPU hardware detected and dmesg shows bcm2835-power: Timeout waiting for grafx power OK.

However, I have several Pi 3B+ and it is NOT happening on all of them (using the same SD card with the same image). Some of them detect the VC4 GPU during boot just fine.

And with the other boards, it appears to be temperature related. When the board is at room temperature (having been unpowered for some time) the GPU is detected normally. Also, over a couple of reboots. But after some minutes, when the temperature rises above about 50 °C, the GPU is not detected any longer on reboot and the bcm2835-power log message appears.

Maybe that additional piece of information helps tracking down the issue.

lategoodbye commented 5 years ago

Thanks for your report. I build the Mainline kernel 5.3-rc6 with multi_v7_defconfig (Raspbian rootfs) for my RPI 3B+. Then i caused enough load to reach ~ 54 °C (no cpufreq enabled) and triggered a reboot. "Unfortunately" i wasn't able to reproduce the timeout.

xnorbt commented 5 years ago

Thanks for looking into it. I have seven Pi3B+ boards and I am currently testing them all under the same conditions to see how many of them are affected (so far 2 out of 4 fail when warm, fully reproducible; the others never fail). Maybe some chips are more 'sensitive' to the power-up ramp than others. Could changing the current ramp (lower initial, lower step size, more time between steps) help? I'd try playing around with bcm2835-power.c but I have no experience integrating a custom kernel for the RPi and don't know if it is as simple as 'replace the ARCH kernel with the selfmade one'. Let me know if I can do any useful tests with the affected boards.

xnorbt commented 5 years ago

One update: I started building the (mainline) kernel using your defconfig (arm64/configs/defconfig). I interrupted when I realized that it is going to take some time... I'll do it at home over night ;-). But: I started compilation on one of the boards which were not affected. Then, during compile, temperature rised to 65°C, and I rebooted -> no VC4 and the bcm2835-power timeout occured. After cooling down back to 50°C the GPU was again recognized normally during several reboots.

So it is definitely a matter of temperature, but the cut between good and bad varies from device to device. Maybe you can stress your board to higher temperatures and see if the timeout appears as well.

vianpl commented 5 years ago

FYI I'm seeing the timeouts on my RPI3b+ with 5.3.0-rc4. Can't really say whether it's temperature related as it always fails. I can run some debugging if needed.

lategoodbye commented 5 years ago

After enabling the Mainline cpufreq driver i'm seeing the timeouts, too.

vianpl commented 5 years ago

IIRC The main functional difference between the downstream cpufreq driver and upstream is that we're disabling turbo mode when changing the clocks.

What about no cpufreq and setting arm's clock @ 1.2GHz in config.txt?

lategoodbye commented 5 years ago

I don't think there is a issue with cpufreq driver. Since my default governour is ondemand, this causes much more CPU stress during boot.

I will try to test your suggestion.

lategoodbye commented 5 years ago

My test results: arm_freq=1200, no cpufreq => no timeout force_turbo=1, no cpufreq => timeout

lategoodbye commented 5 years ago

@popcornmix Any idea to analyze this further? Without documentation i don't have a clue what's going on in the new bcm2835 pm driver.

lategoodbye commented 5 years ago

I made a register dump of the PM addresses for the following cases: 1) Linux 5.3 without e1dc2b2e1bef7237fd8fc055fe1ec2a6ff001f91 (this should be similiar to pre Linux 5.1) 2) Linux 5.3 with e1dc2b2e1bef7237fd8fc055fe1ec2a6ff001f91 (this should be similiar to Linux 5.1 or newer), without timeout occured

Comparing both dumps showed only 1 difference: 1) PM_RSTS (Addr 0x3F100020) = 0x00001000 2) PM_RSTS (Addr 0x3F100020) = 0x00000000

Note: without e1dc2b2e1bef7237fd8fc055fe1ec2a6ff001f91 and with enabled forced_turbo i'm not able to reproduce the timeout

@anholt Is this expected?

popcornmix commented 5 years ago

@lategoodbye the difference in PM_RSTS registers is just:

12 |   | HADPOR | Had a power-on reset

so I guess first was captured after a power cycle, and second after a sudo reboot

lategoodbye commented 5 years ago

Okay, thanks. So the difference is unrelated.

I will wait for suggestions to narrow down this issue until the release of Linux 5.4-rc1, after that i will revert e1dc2b2e1bef7237fd8fc055fe1ec2a6ff001f91 according to the no regression policy.

satmandu commented 4 years ago

For what it is worth I am seeing this error pop up multiple times with 5.3.0 on a 3b+ running arm64/ubuntu using a mainline kernel from here: https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.3/ with config.txt using this dtb: device_tree=bcm2837-rpi-3-b-plus.dtb dmesg: https://paste.ubuntu.com/p/sKT7KyJdSc/

I'm noticing that a warm reboot using sudo reboot fails (or is very very very delayed), but power cycling allows the device to come up just fine.

(My setup is currently headless, so I'm not seeing what comes up on the screen when this situation arises.)

It seems this might be connected? (Or I can open another issue if it seems unconnected.)

pelwell commented 4 years ago

The error message is the same, and the fact that upstream code shows the same issue is useful datapoint.

By the way, you should be able to replace device_tree=bcm2837-rpi-3-b-plus.dtb with the more general upstream_kernel=1.

lategoodbye commented 4 years ago

Yesterday, i tested the revert against current Mainline Linux 5.4 + Raspbian Buster with a Raspberry Pi 3 B+ . Unfortunately X hangs completely during boot, so i asked Florian to drop this patch :-(

redchenjs commented 4 years ago

Add these lines to the dts file, compile it, replace the dtb with the newly compiled one, then the gpu will start working.

&v3d {
    power-domains = <&power RPI_POWER_DOMAIN_V3D>;
};

bcm2837-rpi-3-b.dtb.zip

lategoodbye commented 4 years ago

Add these lines to the dts file, compile it, replace the dtb with the newly compiled one, then the gpu will start working.
&v3d {
  power-domains = <&power RPI_POWER_DOMAIN_V3D>;
};
bcm2837-rpi-3-b.dtb.zip

This was the reason behind the revert. But the revert causes hang during boot of Raspbian, so i decided to drop the revert.

https://patchwork.kernel.org/patch/11136979/

redchenjs commented 4 years ago

It seems that without these reverts, the GPU will also work, so maybe these reverts cause the X hang?

diff --git a/arch/arm/boot/dts/bcm283x.dtsi b/arch/arm/boot/dts/bcm283x.dtsi
index 2d191fc..b238567 100644
--- a/arch/arm/boot/dts/bcm283x.dtsi
+++ b/arch/arm/boot/dts/bcm283x.dtsi
@@ -3,7 +3,6 @@ 
 #include <dt-bindings/clock/bcm2835-aux.h>
 #include <dt-bindings/gpio/gpio.h>
 #include <dt-bindings/interrupt-controller/irq.h>
-#include <dt-bindings/soc/bcm2835-pm.h>

 /* firmware-provided startup stubs live here, where the secondary CPUs are
  * spinning.
@@ -121,7 +120,7 @@ 
            #interrupt-cells = <2>;
        };

-       pm: watchdog@7e100000 {
+       watchdog@7e100000 {
            compatible = "brcm,bcm2835-pm", "brcm,bcm2835-pm-wdt";
            #power-domain-cells = <1>;
            #reset-cells = <1>;
@@ -641,7 +640,6 @@ 
            compatible = "brcm,bcm2835-v3d";
            reg = <0x7ec00000 0x1000>;
            interrupts = <1 10>;
-           power-domains = <&pm BCM2835_POWER_DOMAIN_GRAFX_V3D>;
        };

        vc4: gpu {

lategoodbye commented 4 years ago

It seems that without these reverts, the GPU will also work, so maybe these reverts cause the X hang?

Devicetree changes usually don't cause hangs, it's more a driver issue. According your change you combine the "best" of both power drivers. Unfortunately it's unsafe to handle the same register ranges with two Linux drivers. Currently i only see two options:

Revert most of the BCM2835 power series
Port parts of firmware power driver into the BCM2835 power driver

sankayop commented 4 years ago

Add these lines to the dts file, compile it, replace the dtb with the newly compiled one, then the gpu will start working.
&v3d {
  power-domains = <&power RPI_POWER_DOMAIN_V3D>;
};
bcm2837-rpi-3-b.dtb.zip

This solves it for me. I just replaced the old /boot/dtbs/broadcom/bcm2837-rpi-b.dtb with yours and it worked. Thanks @redchenjs (cfg: raspberry pi 3b + manjaro)

maggu2810 commented 4 years ago

A RPi3B+ of mine has not been used for a while. Yesterday I started with a new project and I setup the RPi.

I used a new SD card and prepared it with Arch Linux ARM AArch64. I run into the problem reported here.

I turned the RPi off yesterday evening and turned it on this morning. Same problem. As the RPi has been turned off for hours I don't think mine has been too hot on its first power up this morning.

So, if you need another board to get some diagnostic information, I can try to provide.

pelwell commented 4 years ago

The consensus above is that this is caused by an incompatibility in the upstream/mainline 3B+ DTB. Edit the source file as described by @sankayop above and rebuild it (or download the prebuilt version they link to) and try with that.

maggu2810 commented 4 years ago

Thank you for your reply, will give it a try later...

Will the specific change that seems to be applied to all the fedora kernel versions make it upstream? I did not find it here https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/bcm2835-rpi.dtsi

lategoodbye commented 4 years ago

Currently for upstream i only see two "options": 1) revert Eric's complete bcm2835-power series 2) merge the working parts from raspberrypi-power into bcm2835-power

I'm not happy with both of them. @sankayop patch will enable both power driver for the same power domain. I consider this as a path to hell ...

pelwell commented 4 years ago

@lategoodbye Do you have a preference between 1 and 2? Is there something we can do to help?

lategoodbye commented 4 years ago

Number 1 isn't a real option, because we need this driver for Raspberry Pi 4. Number 2 should be do able for downstream, but would result more likely in a merge of both drivers for upstream.

The best option would be to ask someone with deeper understanding of BCM2835 why the rampup causes these random timeouts (timing issue, missing requirements, wrong order of power domain handling) and fix the bcm2835 power driver.

menteb commented 4 years ago

Any updates on this?

lategoodbye commented 4 years ago

In the upstream kernel the suggested patch to revert has been applied. The hanging X issue was unrelated. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20200402&id=e7b7daeb48e0bf5d8412d77f11069750ee7032bb

CodingKoopa commented 4 years ago

When booting up my Raspberry Pi 2 Model B with Arch Linux ARM, it seems one of two things happens:

The system boots up with the proper resolution I have specified in my config.txt. When I run startx to start LXDE, the DE launches, then almost immediately hangs.
- The timeout log message is not present in my system journal.
The system boots up with a lower resolution. When I run startx, LXDE functions as expected.
- The timeout log message is present multiple times.
- There seems to be no GPU acceleration present.

In my testing, it does seem that the first occurrence is more likely when the Pi is cooled down, rather than right after rebooting. Most of these are things that have already been pointed out, but I wanted to provide a test case for anyone else having the issue.

Is the issue with X hanging being tracked anywhere?

lategoodbye commented 4 years ago

Is the issue with X hanging being tracked anywhere?

Here is the accepted fix: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20200504&id=b1e7396a1d0e6af6806337fdaaa44098d6b3343c

sixtyfive commented 3 years ago

Seems the Pi 3 A+ has the same problem :-/

raspberrypi / linux

bcm2835-power: Timeout waiting for grafx power OK #3046