raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.14k stars 4.99k forks source link

Aux SPI (SPI1) + Console on Aux UART (ttyS0) = Lockup #1484

Closed frasersdev closed 7 years ago

frasersdev commented 8 years ago

I've come across an issue that causes raspbian to boot up very slowly then lockup at the login prompt. The issue is occurring with the latest kernel (4.4.9-v7+ #884)

How to reproduce:

The lockup occurs when the following 3 items are configured

As you can see the PI3 is particularly vulnerable as 2 of the 3 conditions are already set so all it takes is activating SPI1 and you get a lockup.

Workarounds:

On a PI3, stopping the console being redirected onto the Aux UART will solve it. There are a few ways to do this depending on what other hardware you need.

On a PI2 or earlier - don't direct the console to the Aux UART if you are using SPI1!

Symptoms and Errors:

When a PI is configured as per above and restarted It boots very slowly. This seems to start when the console is enabled on ttyS0, e.g. kernel: [ 1.083247] console [ttyS0] enabled

Throughout the boot process you might see an error like kernel: [ 49.972018] serial8250: too much work for irq59

The boot process is now a lot longer. e.g. 120secs vs 15 secs for a PI2, 90sec vs 12secs for a PI3.

The console speeds up briefly just before the login prompt is displayed but then locks up hard. Only recovery option is to mount the SD in another machine and reconfigure cmdline.txt / config.txt as per the workarounds.

Root Cause:

Unsure I this stage. The Aux peripherals share a clock so it seems probable that that is the root cause: The slow boot could be caused by inadvertently configuring the aux clock incorrectly. The lockup could be caused by the clock being reconfigured whilst enabled.

The uart baud rate seems to be set at least twice during the boot up (once by the kernel for the console and once by getty?) does this cause a clock reconfiguration? somehow having SPI1 active buggers this up (is it setting up the clock independently?)

Unsure of where to look - I'm not even sure which register is the aux clock even... https://github.com/raspberrypi/linux/blob/rpi-4.4.y/drivers/clk/bcm/clk-bcm2835.c

pelwell commented 8 years ago

Thanks for the thorough report. You are correct that the Aux devices share a clock, but that clock is the core clock and is only controlled by the VPU. I think it's more likely that there is a problem with the interrupt handling, since they also share an interrupt line and require an extra level of interrupt decoding.

stevenj commented 8 years ago

If i am reading the kernel source correctly, then the AUX Uart must be being processed by: linux/drivers/tty/serial/8250/8250_core.c

Which seems to be a generic 8250 driver, and not specific to the quirks of the bcm aux serial port.

The error "kernel: [ 49.972018] serial8250: too much work for irq59" happens when the Uart IRQ spins processing UART IRQ's , waiting for the IRQ to go away after servicing the uart. It doesnt have any knowledge in there that this is in fact an AUX serial port and so has no knowledge that the SPI can also cause interrupts. SO you could presumably get into a situation that its stuck in there waiting for Interrupts to go away, but they never do, because the SPI cant process them, until it times out with this error.

Assuming this is the driver being used for the aux uart, then It would seem naive to use this driver for the aux uart, because its not an 8250, just smells a bit like one, and clearly doesnt behave like an 8250 with regard to its interrupts. I think pelwell is spot on with his belief that its interrupt handling causing the problem.

Looks like the aux uart might need its own specific driver that know these things. Benefit, it would be simpler than the 8250 generic driver simply because of its reduced functionality.

A second question this raises of course is why the SPI causes ANY interrupts to be generated at all, because it shouldn’t generate any in the absence of SPI transfers..... Just enabling the SPI port shouldn't do that.

rewolff commented 8 years ago

Agreed, my reading too suggests that the SPI fires off an interrupt. One option is that the SPI driver enables the interrupt, but before claiming the IRQ vector. IIRC, Linux will handle multiple devices on one interrupt vector correctly. Something that was left over from when X86 systems only had sixteen of them.

The only way that sharing of IRQs can work is that the IRQ routine returns as fast as possible when it's device was not responsible for the interrupt. This is something that the 8250 driver MUST be capable of, because it dates from the time when COM1, COM4 were both IRQ3 and COM2, COM3 where both IRQ4. Or something like that.

This would mean that the bug lies in the SPI driver that has not activated its interrupt routine yet, when the module fires off an interrupt.

Checking out the SPI driver... This would happen when that first interrupt does not satisfy: if (bcm2835_rd(bs, BCM2835_SPI_CS) & BCM2835_SPI_CS_DONE) { Then the hardware is not reset. On the other hand, during initialization, the CS register is intiialized to "no interrupts". So why clearing the interrupt enable flags would help... beats me.

stevenj commented 8 years ago

If you look at the code for the standard UART driver, you will see this is precisely what it DOES NOT DO. It spin loops, while ever there is an IRQ present, eventually times out, and generates "kernel: [ 49.972018] serial8250: too much work for irq59" which is the UART driver saying, hey, I spun for a long time and the IRQ never went away. The reason why its spinning is conjecture at this point, and it seems to be related to the SPI, BUT that is the reported ERROR from the kernel side.

I agree that the driver should be checking its IRQ status and returning as FAST as possible, for any device in the AUX device block, that is as easy as checking a single bit in AUXIRQ Register (0x7E21 5000), but i cant find anywhere that does that.

I think the problem is that the UART drivers inside the Linux Kernel seem to believe that they are NOT SHARED with anything but a UART. And in this case thats not true.

The UART IRQ handler spins, looking for all possible uarts on the same IRQ and process them until the IRQ goes away. This is because it assumes the UART is on a ISA bus which only has a edge triggered interrupt, and if you return without restoring the state of the IRQ to not asserted, you will not get any more IRQ's and your driver will stall. Again, this is a quirk of ISA bus and has nothing to do with the BCM8253 Uart implementation, and this is precisely where we seem to be getting hung up, based on the messages being reported by the Kernel.

The belief the SPI might be causing an IRQ and not handling it properly is beside the point, the UART driver should not get hung up like this, regardless of what is going on in the SPI driver side. But i also read the SPI code, which specifically clears out all pending SPI interrupts before enabling the IRQ, so im not 100% convinced its that which is triggering this fault in the UART driver. But even if it is, the UART driver should not hang up like this, because even if we clear this potentially spurious IRQ at initialisation, any use of SPI1 in conjunction with UART1 will possibly trigger this fault in the UART driver.

From the Broadcom BCM2835 Peripherals Manual:

The implemented UART is not a 16650 compatible UART However as far as possible the first 8 control and status registers are laid out like a 16550 UART.

And so, we see this in the code, which enables the module clocks, because the UART driver doesn’t know anything about that: https://github.com/raspberrypi/linux/commit/a72021ae99021409319d552dad744572b20febfb

/* Use this hack until a proper solution is agreed upon */ static void __init bcm2835_init_uart1(void)

We also have this comment from the Broadcom BCM2835 Peripherals Manual:

GPIO pins should be set up first the before enabling the UART. The UART core is build to emulate 16550 behaviour. So when it is enabled any data at the inputs will immediately be received . If the UART1_RX line is low (because the GPIO pins have not been set-up yet) that will be seen as a start bit and the UART will start receiving 0x00-characters.

Valid stops bits are not required for the UART. (See also Implementation details). Hence any bit status is acceptable as stop bit and is only used so there is clean timing start for the next bit.

Looking after a reset: the baudrate will be zero and the system clock will be 250 MHz. So only 2.5 μseconds suffice to fill the receive FIFO. The result will be that the FIFO is full and overflowing in no time flat.

Which kind of looks more like the problem being experienced. A UART generating millions of IRQ's which the system can not process. Would also explain the really big slow down also reported. Because, even if the SPI caused the interrupt, the UART would eventually time out, the SPI should then get a chance to service, it should clear the source of the interrupt and then we should be back to normal, but that does not happen.

What we need is a dump of the full AUX device state (IRQ, CLOCK, UART1, SPI1 & 2) at the time the kernel prints "kernel: [ 49.972018] serial8250: too much work for irq59" and then we could see whats going on exactly.

stevenj commented 8 years ago

Can anyone confirm this on 4.4.13-v7+ ??? I started looking at it and was going to throw some debug code into the kernel to test, but on stock rasbian lite, upgraded to kernel 4.4.13-v7+ I am unable to reproduce the error reported above.

frasersdev commented 8 years ago

Can anyone confirm this on 4.4.13-v7+ ???

A recent update disabled the mini-uart by default on the PI3b so you need to explicitly enable it now. See Phil's post for details: https://www.raspberrypi.org/forums/viewtopic.php?f=28&t=141195

prior to enabling it you'll probably see

        uart@7e215040 {
            reg = <0x7e215040 0x40>;
            interrupts = <0x1 0x1d>;
            pinctrl-0 = <0x13>;
            reg-shift = <0x2>;
            no-loopback-test;
            compatible = "brcm,bcm2835-aux-uart", "ns16550";
            clocks = <0x12>;
            status = "disabled";
            phandle = <0x23>;
            pinctrl-names = "default";
        };

to enable just add 'enable_uart=1' to the config.txt and you'll have slow boot/lockup on a PI3b

meerd commented 7 years ago

Any update on this?

I have a custom designed board based on compute module v1. UART1 is being used on pins 32 and 33 . (as the console) When SPI1 is not enabled, everything is fine. However, when sc16is752-spi1 overlay is enabled, UART1 hangs up immediately.

Just as an information, with kernel 4.4.34 the system locks up. with 4.4.16 the device slows down but keeps working.

pelwell commented 7 years ago

I'm afraid this issue dropped off the radar. We're going to be short-handed until January, but if you ping us in the new year, and if it is as easy to reproduce as it sounds, then we ought to be able to make some progress.

mypiandrew commented 7 years ago

With reference to issue 1635 (effective duplicate of this) has there been any update?

pelwell commented 7 years ago

I've looked for an SC16IS752 module, but not found one yet - acquiring suitable test hardware is a hurdle to understanding and fixing this issue. Can anyone recommend something?

mypiandrew commented 7 years ago

Hi

Forgive me if I'm barking up the wrong tree here, but I was able to lock the system by just enabling two of the aux peripherals (SPI and com port) via overlays at the same time and then trying to use the com port

Your comment here :

https://github.com/raspberrypi/linux/issues/1635#issuecomment-246464205

Suggested this is was shared interrupt problem (?) rather than an using a specific piece of hardware

FWIW eBay (god bless them) have a SC16IS752 module here :

http://www.ebay.co.uk/itm/121292250289

Hope this helps?

pelwell commented 7 years ago

Yes, it probably is a shared interrupt problem, but we still need some (ideally simple) SPI peripheral to be able to tickle it.

That module is a 750 (not a 752), and the simplest test case I've come across involved two 752s wired back to back, but if that is sufficient then I'm sure our budget can stretch to buying one.

mypiandrew commented 7 years ago

What about a CAN-BUS module - that's pretty cheap and easy to obtain

Again, I was able to trigger the problem without having an SPI device plugged in, just the CAN overlay enabled

pelwell commented 7 years ago

Again, I was able to trigger the problem without having an SPI device plugged in, just the CAN overlay enabled

That's useful information, but can you show me where you previously mentioned not needing any hardware?

mypiandrew commented 7 years ago

OK, my bad (too many brain cells killed over christmas!)

https://github.com/raspberrypi/linux/issues/1635

So assuming that's good enough for testing this is a cheap module from ebay UK : http://www.ebay.co.uk/itm/142161216681

meerd commented 7 years ago

Not in the short term, but I can provide you a custom SC16IS752 board if needed. @pelwell

pelwell commented 7 years ago

@mypiandrew Does that need a CAN device to exhibit the problem? I'd prefer a UART because at least I have suitable equipment to hook it to.

@meerd Thank you for the offer. I'll buy some 750 modules and see how I get on.

mypiandrew commented 7 years ago

No didn't need to be hooked up to any CAN equipment, just had the HAT board fitted.

mypiandrew commented 7 years ago

This might be of interest

Running the new CM3 I've had to update from 4.1.21 to a later version, however I forgot to make a change to the config.txt file when updating and got a surprise..

Running this : Linux raspberrypi 4.4.43-v7+ #948 SMP Sun Jan 15 22:20:07 GMT 2017 armv7l GNU/Linux

this cmdline.txt: dwc_otg.lpm_enable=0 console=serial1,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait

and this config.txt :

dtoverlay=mmc dtdebug=on dtoverlay=uart0,txd0_pin=32,rxd0_pin=33 dtoverlay=uart1,txd1_pin=14,rxd1_pin=15 dtparam=i2c1=on dtoverlay=i2c-rtc,ds1307 dtparam=spi=on dtoverlay=mcp2515-can0,oscillator=16000000,interrupt=25 dtoverlay=spi-bcm2835-overlay

The board starts up fine and detects the CAN controller correctly resulting in a CAN0 interface (haven't done a comms test, but this is further than I've seen before)

vcdbg log msg shows :

002584.883: dtparam: spi=on 002681.517: Loaded overlay 'mcp2515-can0' 002681.537: dtparam: oscillator=16000000 002682.313: dtparam: interrupt=25 002759.139: Loaded overlay 'spi-bcm2835-overlay'

I can run both the below without problem

stty -F /dev/ttyS0 -a stty -F /dev/ttyAMA0 -a

But if I use the below overlay (which I assumed was the correct one for this kernel) :

dtoverlay=spi1-1cs

Then the board boots s-l-o-w-l-y , ethernet gets an IP but then you cannot SSH into it

This effect happens with and without a CAN board plugged in

mypiandrew commented 7 years ago

related forum post ? https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=146291

mypiandrew commented 7 years ago

Changing

dtoverlay=spi1-1cs

for

dtoverlay=spi0-hw-cs

Board boots OK and now GPIO7/8 show SPI names rather than just "Output"

root@raspberrypi:~# raspi-gpio get BANK0 (GPIO 0 to 27): GPIO 00: level=1 fsel=0 alt= func=INPUT GPIO 01: level=1 fsel=0 alt= func=INPUT GPIO 02: level=1 fsel=4 alt=0 func=SDA1 GPIO 03: level=1 fsel=4 alt=0 func=SCL1 GPIO 04: level=0 fsel=1 alt= func=OUTPUT GPIO 05: level=0 fsel=1 alt= func=OUTPUT GPIO 06: level=1 fsel=0 alt= func=INPUT GPIO 07: level=1 fsel=4 alt=0 func=SPI0_CE1_N GPIO 08: level=1 fsel=4 alt=0 func=SPI0_CE0_N GPIO 09: level=0 fsel=4 alt=0 func=SPI0_MISO GPIO 10: level=0 fsel=4 alt=0 func=SPI0_MOSI GPIO 11: level=0 fsel=4 alt=0 func=SPI0_SCLK GPIO 12: level=0 fsel=0 alt= func=INPUT GPIO 13: level=0 fsel=0 alt= func=INPUT GPIO 14: level=1 fsel=2 alt=5 func=TXD1 GPIO 15: level=1 fsel=2 alt=5 func=RXD1 GPIO 16: level=0 fsel=0 alt= func=INPUT GPIO 17: level=0 fsel=0 alt= func=INPUT GPIO 18: level=0 fsel=0 alt= func=INPUT GPIO 19: level=0 fsel=0 alt= func=INPUT GPIO 20: level=0 fsel=0 alt= func=INPUT GPIO 21: level=0 fsel=1 alt= func=OUTPUT GPIO 22: level=0 fsel=1 alt= func=OUTPUT GPIO 23: level=0 fsel=0 alt= func=INPUT GPIO 24: level=0 fsel=0 alt= func=INPUT GPIO 25: level=1 fsel=0 alt= func=INPUT GPIO 26: level=0 fsel=0 alt= func=INPUT GPIO 27: level=0 fsel=0 alt= func=INPUT

I note that I am using SPI0 and not SPI1 or SPI2 here, if my issue has been unrelated my apologies please feel free to remove my posts for clarity.

pelwell commented 7 years ago

OK - I can reproduce this with an SC16IS750 module on SPI1. Watch this space.

pelwell commented 7 years ago

The 4.9 kernel has a dedicated driver for UART1 (bcm2835-aux-uart), and I've found that booting into 4.9 works fine where 4.4 would stall and generate kernel warnings about overloaded IRQs when SPI1 is enabled.

We are due to switch to 4.9 very soon as our mainline kernel, but you can try it now by running:

BRANCH=next sudo rpi-update

Normally this isn't something I would advise - you get bleeding edge features and the possibility of bugs - but there are currently no known problems with the 4.9 kernel.

If you have a spare card, I would recommend installing a fresh Raspbian, updating to the next branch as above, then running your test again. Or you could make a backup of your card using one of the disk imaging tools before updating.

@mypiandrew The sdhost driver is now recommended for all Pis, including CMs, so there is no need for "dtoverlay=mmc" any more. Also, spi-bcm2835 is the default driver so "dtoverlay=spi-bcm2835-overlay" is pointless (and never include "-overlay" when you are referring to overlays).

mypiandrew commented 7 years ago

Nice work Phil!

pelwell commented 7 years ago

Note that there even in 4.9 there is a potential problem with the Aux peripherals under load - see #1573. I think the correct solution is to create a new Aux interrupt controller, which should be a case of copying some code from pinctrl-bcm2835.c and either adding it to the aux clock driver (@anholt?) or creating a dedicated driver, having first prised ownership of the IRQ register from the clock driver. The larger battle is likely to be getting it upstreamed.

frasersdev commented 7 years ago

thanks Phil, I'll try out 4.9

anholt commented 7 years ago

We may be able to just put an irq controller driver in the existing aux driver. We'd have to try it to see how the subsystem maintainers reacted to it, though (different subsystems feel differently about whether you can use their subsystem from outside of their directory).

pelwell commented 7 years ago

The current trees now include a modified bcm2835-aux driver that also acts as an interrupt controller, and DT modifications to activate it. I've tested it with the console over ttyS0 running an MFRC reader on SPI1.

JamesH65 commented 7 years ago

ping @pelwell Any thoughts on what to do with this one?

pelwell commented 7 years ago

The current trees now include a modified bcm2835-aux driver that also acts as an interrupt controller, and DT modifications to activate it. I've tested it with the console over ttyS0 running an MFRC reader on SPI1.

This is resolved.

sudheepss commented 7 years ago

Thanks frasersdev. I found the error: bcm2835-aux-uart 3f215040.serial: could not get clk: -517 when I used uart pins 8, 10 for serial debug. Removing console=serial0,115200 from the /boot/cmdline.txt will disallow serial debug with that port. So I just added dtoverlay=pi3-disable-bt to the /boot/config.txt (Disables the bluetooth module and reinstates the PL011 UART as the console) and then the serial debug is active and the lockup error is not there.