raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.18k stars 1.68k forks source link

RPi3 start.elf regression: garbled serial, broken USB in UEFI #1376

Open andreiw opened 4 years ago

andreiw commented 4 years ago

This was seen in UEFI (Tiano Core) build for RPi3 with the latest VPU firmware. Symptoms: garbled UART output and DWC2 USB2 stopped working. Now, for USB the UEFI fw was querying mbox interface for the reported clock, so this could mean the reported clock was wrong.

Tracked https://github.com/pftf/RPi3/issues/15

To reproduce: Grab latest (v1.22) release from https://github.com/pftf/RPi3/releases. It should boot with working USB and non-garbled UART. Now replace the VPU bits (start + fixup) with your latest. See USB not working, see garbled UART.

Could someone explain why we might be seeing this behavior?

timg236 commented 4 years ago

Please can you specify the firmware release where you think the behaviour changed and the last known good release?

This forum thread might be relevant https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=267576&start=50

pbatard commented 4 years ago

Please can you specify the firmware release where you think the behaviour changed and the last known good release?

I've just confirmed that the last known good start.elf was with 9e3c23ce779e8cf44c33d6a25bba249319207f68 and breakage started with 11a76e07ef1b6304a378c4ee3da200fe6facea46 (i.e. Switch to building from common firmware branch).

For reference, here's the output from "good" start.elf:

Raspberry Pi Bootcode
Read File: config.txt, 93
Read File: start.elf, 2884708 (bytes)
Read File: fixup.dat, 6746 (bytes)
MESS:00:00:01.205420:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:01.209630:0: brfs: File read: 93 bytes
MESS:00:00:01.264396:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:01.491377:0: gpioman: gpioman_get_pin_num: pin DISPLAY_DSI_PORT not defined
MESS:00:00:01.499626:0: *** Restart logging
MESS:00:00:01.502114:0: brfs: File read: 93 bytes
MESS:00:00:01.522913:0: hdmi: HDMI:hdmi_get_state is deprecated, use hdmi_get_display_state instead
MESS:00:00:01.540564:0: brfs: File read: /mfs/sd/bcm2710-rpi-3-b-plus.dtb
MESS:00:00:01.545668:0: Loading 'bcm2710-rpi-3-b-plus.dtb' to 0x100 size 0x69ef
MESS:00:00:01.775325:0: brfs: File read: 27119 bytes
MESS:00:00:01.780823:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:01.786133:0: brfs: File read: 93 bytes
MESS:00:00:01.789303:0: Failed to open command line file 'cmdline.txt'
MESS:00:00:03.777754:0: gpioman: gpioman_get_pin_num: pin EMMC_ENABLE not defined
MESS:00:00:04.082551:0: brfs: File read: /mfs/sd/RPI_EFI.fd
MESS:00:00:04.086429:0: Loading 'RPI_EFI.fd' to 0x0 size 0x200000
MESS:00:00:04.092259:0: No compatible kernel found
MESS:00:00:04.096755:0: Device tree loaded to 0x2eff9200 (size 0x6df2)
MESS:00:00:04.104680:0: uart: Set PL011 baud rate to 103448.300000 Hz
MESS:00:00:04.110968:0: uart: Baud rate change done...
MESS:00:00:04.114401:0: uart: Baud rateNOTICE:  Booting Trusted Firmware
NOTICE:  BL1: v2.3():v2.3
NOTICE:  BL1: Built : 10:40:46, Apr 21 2020
NOTICE:  rpi3: Detected: Raspberry Pi 3 Model B+ (1GB, Sony, UK) [0x00a020d3]
NOTICE:  BL1: Booting BL2
NOTICE:  BL2: v2.3():v2.3
NOTICE:  BL2: Built : 10:40:42, Apr 21 2020
NOTICE:  BL1: Booting BL31
NOTICE:  BL31: v2.3():v2.3
NOTICE:  BL31: Built : 10:40:44, Apr 21 2020
UEFI firmware (version UEFI Firmware v1.22 built at 13:57:51 on Apr 21 2020)

And here's the one from "bad" start.elf:

Raspberry Pi Bootcode
Read File: config.txt, 93
Read File: start.elf, 2999712 (bytes)
Read File: fixup.dat, 6746 (bytes)
NOTICE:  Booting Trusted Firmware
NOTICE:  BL1: v2.3():v2.3
NOTICE:  BL1: Built : 10:40:46, Apr 21 2020
NOTICE:  rpi3: Detected: Raspberry Pi 3 Model B+ (1GB, Sony, UK) [0x00a020d3]
NOTICE:  BL1: Booting BL2
NOTICE:  BL2: v2.3():v2.3
NOTICE:  BL2: Built : 10:40:42, Apr 21 2020
NOTICE:  BL1: Booting BL31
NOTICE:  BL31: v2.3():v2.3
NOTICE:  BL31: Built : 10:40:44, Apr 21 2020
k▒▒Ar▒[[▒I▒IxVu]P+J▒ew+es▒X▒H▒Sv
▒MRbQ▒Y▒m▒mf▒M▒MUVVQ▒X▒BVAT▒i{V▒X`yUYv%▒G+%%]w▒▒▒֖▒▒▒M▒MU^vQ▒▒Aj |▒Af▒w▒A2UAwiuew▒3▒Es5D+gP▒i6▒5D▒gbK]▒8▒gP▒▒u-A:▒v+eP▒KP▒uiPi]q*eP)▒iV▒

One thing that's immediately noticeable is that the newer start.elf seems to be a lot less verbose when using uart_2ndstage=1 in config.txt.

And the other thing I will point out is that whereas TF-A (the NOTICE messages are all from TF-A) does not perform serial baud init and therefore expects the baudrate to have been set to 115200 beforehand, we do use the 12.12 fixed point VPU clock divisor to set up our baudrate, so that we are able to use something else than 115200 is neded, as per https://github.com/tianocore/edk2-platforms/blob/master/Platform/RaspberryPi/Library/DualSerialPortLib/DualSerialPortLib.c#L143-L176.

However, it's not the baudrate change that's our main concern here (if the formula to be used to set up the baudrate has changed, we we can always update our code), but rather the complete loss of USB keyboard input, which is a lot more problematic as a regression...

andreiw commented 4 years ago

@pbatard in both of these, were we using the miniUART or PL011?

pbatard commented 4 years ago

miniUART.

andreiw commented 4 years ago

As far as USB:

A theory I have, which perhaps may help with triaging, is that Raspbian/Linux doesn't really use the mailbox interface, and relies on DT. Perhaps what happened here is that mbox interface code didn't get updated to return the correct clock?

E.g. (from a decompiled pi3 blob):

` usb@7e980000 {
compatible = "brcm,bcm2708-usb";
...
clocks = <0x16>;
clock-names = "otg";
... }

clocks { ... clock@4 {
compatible = "fixed-clock";
...
clock-output-names = "otg";
clock-frequency = <0x1c9c3800>;
...
};
... } `

pbatard commented 4 years ago

And now that we have found the "updated" baudrate (135000) to read our serial output, we can confirm that mbox's Get clocks from recent start.elf appears to return 0 for the core clock frequency (clock id 0x04), which of course creates all kind of issues...

timg236 commented 4 years ago

Do you get different behaviour if you use the GET_CLOCK_MEASURED (0x30047) or GET_CLOCK_RATE (0x30002) mailbox APIs

The Linux clk-raspberrypi.c and bcm2835-cpufreq.c both use GET_CLOCK_RATE

N.B. "Get clocks" 0x00010007 just enumerates the list of clocks but does not return the frequency.

pbatard commented 4 years ago

We are using 0x00030002/GET_CLOCK_RATE. Sorry if I was unclear about that.

I'll see if I can try GET_CLOCK_MEASURED but it does look to us like, for some reason, GET_CLOCK_RATE returns 0 for the core clock when we query it, whereas previous versions of start.elf returned the expected value.

timg236 commented 4 years ago

GET_CLOCK_RATE returns cached values, it may be that in the common-firmware these values haven't been populated and a Linux MBOX call is causing that to happen. force_turbo=1 in config.txt might change things

GET_CLOCK_MEASURED should returns something because it's using the test counters to measure the clock. It's not necessarily the one to use because it varies slightly but it will at least test the mailbox interface.

pbatard commented 4 years ago

Tested force_turbo=1 but doesn't change anything. I'll test GET_CLOCK_MEASURED, but it may be a little while before I do so.

pbatard commented 4 years ago

Same issue when trying to use GET_CLOCK_MEASURED I'm afraid.

This only seems to occur for clock id 0x04, as the same call returns the expected values for other clocks, as per the following added debug output:

RpiFirmwareGetClockRate: Clock Rate for 0x30004:0x3 = 1400000000
RpiFirmwareGetClockRate: Clock Rate for 0x30047:0x3 = 1400148000
RpiFirmwareGetClockRate: Clock Rate for 0x30047:0x4 = 0

For the record, this is the code we are using to query, with the mbox constants used by that file defined here.

timg236 commented 4 years ago

@popcornmix Does this ring a bell? I'll try to spend some time debugging this next week if it's not an obvious fix.

pbatard commented 4 years ago

Actually, one thing that I am noticing as I am running more tests is that we actually got a mailbox timeout before we poked the clock (that SetPowerState error also explains why we would lose USB):

MailboxTransaction: timeout waiting for inbox to become full
RpiFirmwareSetPowerState: mailbox transaction error: Status == Time out, Response == 0x0
Couldn't power on USB: Device Error

And if I play with the level of debug output (which I guess would have an impact on mailbox timeouts, though in this case I was reducing the amount of messages), I can see Clock Rate for 0x30047:0x4 getting populated, though I am now getting timeouts elsewhere. One explanation I see is that we we may be reading mailbox replies from previous transactions, even though we always make sure to drain the mailbox and wait for status clear before we initiate a new transaction.

I'm going to see if increasing the timeout delay in our code helps. Is there any chance that mailbox transactions have gotten slower after the switch to the common branch?

pbatard commented 4 years ago

Okay, it seems that increasing our mbox timeout does help, and makes both the MailboxTransaction: timeout errors go away as well as ensures that the clock values we read are valid.

And with the SetPowerState transaction no longer failing, our USB works again. But I am still getting a ~135000 baudrate for serial output with the newer start.elf rather than 115200. I guess I'm going to try to instrument our baudrate to see how the values have changed...

Obviously, we're going to have a closer look at what Linux or other implementations do in term of mbox timeouts, but if you have some data on what a "good" value would be for waiting on the mailbox 'input register empty' bit to clear, that could help.

popcornmix commented 4 years ago

I think linux kernel has quite a slow timeout - perhaps a second.

It won't ever be that long, but there real time guarantees for mailbox to respond. I'd say typically it will be below a millisecond, but may be longer if you catch it when we are switching frequency/voltage or hdmi mode, for example.

dp111 commented 4 years ago

My understanding is that there are two mailbox status registers one for each direction. There is some code on the internet that only polls one of the status registers.

This header file for instance only defines one status register :

https://github.com/ms-iot/RPi-UEFI/blob/ms-iot/Pi2BoardPkg/Include/Bcm2836Mailbox.h

On Fri, 24 Apr 2020 at 16:47, popcornmix notifications@github.com wrote:

I think linux kernel has quite a slow timeout - perhaps a second.

It won't ever be that long, but there real time guarantees for mailbox to respond. I'd say typically it will be below a millisecond, but may be longer if you catch it when we are switching frequency/voltage or hdmi mode, for example.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/1376#issuecomment-619091674, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIRR3HTZMLR7FASLWXTROGYCFANCNFSM4MNW35CQ .

pbatard commented 4 years ago

Yes, I think we were probably playing it too close with our mbox timeouts, though what I'm seeing from the returned values is that the latest start.elf (when using enable_uart and without force_turbo) appears to have increased the default CORE clock from 250 MHz to 400 MHz, so I would expect mbox replies to come faster not slower.

For reference, here are my mbox clock frequency results for a Pi 3 B+ with enable_uart=1 and without any core_freq:

Old start.elf:

RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x1 = 200000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x2 = 48000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x3 = 600000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x4 = 250000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x5 = 250000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x6 = 250000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x7 = 250000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x8 = 400000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x9 = 0
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xA = 0
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xB = 0
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xC = 0
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xD = 0
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xE = 0

New start.elf:

RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x1 = 200000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x2 = 48000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x3 = 600000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x4 = 400000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x5 = 250000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x6 = 250000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x7 = 250000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x8 = 400000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0x9 = 0
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xA = 0
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xB = 250000000
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xC = 0
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xD = 0
RpiFirmwareGetClockRate: Clock Rate for 0x30002:0xE = 0

So the values of clocks 0x4 (CORE) and 0xB (HEVC) have changed, which I guess is a good thing for HEVC that was missing, but, as far as CORE is concerned, would seem to make https://www.raspberrypi.org/documentation/configuration/uart.md incorrect when it states that:

To enable it, add enable_uart=1 to config.txt. This will also fix the core frequency to 250MHz

I can also confirm that adding core_freq=250 to config.txt "fixes" the baudrate issue, which makes sense since, even as we are using VPU_CLOCK_DIVISOR, we apply that on a hardcoded value based on a 250MHz base clock.

Is 400 MHz supposed to be the new CORE value with enable_uart=1 then? Or is there something else that we might be missing?

timg236 commented 4 years ago

On 2711 + common firmware enable_uart now fixes frequency at the higher default frequency but this can be overriden with core_freq=250 core_freq_min=250

pbatard commented 4 years ago

Okay. We can definitely apply core_freq as a workaround, since 250 MHz is what we were relying on before.

But you may want to update the UART documentation to clarify that enable_uart=1 alone does not always result in a 250 MHz core freq, as I suspect other folks might get bitten by this new behaviour.

stintel commented 4 years ago

After upgrading my Raspberry Pi Zero W with an RS485 HAT to the latest OpenWrt snapshot from the master branch, I experienced difficulties communicating with the SDM630 modbus power meter connected to that RS485 HAT. When debugging the problem on another device, the serial console output is garbled as well, and input is almost completely broken.

It seems that having bluetooth enabled also influences the core clock, as the problem goes away with either of the following 2 lines in config.txt:

core_freq=250
dtoverlay=disable-bt

I verified that this problem does not happen on https://github.com/raspberrypi/firmware/commit/9e3c23ce779e8cf44c33d6a25bba249319207f68 and started with https://github.com/raspberrypi/firmware/commit/11a76e07ef1b6304a378c4ee3da200fe6facea46.