4.14: bad mode in data abort handler detected

maxnet commented 6 years ago

On Pi 3+, 4.14 branch, I occasionally get a kernel panic on boot. With a "bad mode in data abort handler detected" message, and PC at dwc_otg_fiq_fsm

Probably some rare timing specific problem, as:

it only happens occasionally.
can no longer reproduce it when I attach a serial console cable and set enable_uart=1
nor when having "quiet" in cmdline.txt

vlcsnap-2018-03-18-23h37m03s891

vlcsnap-2018-03-18-23h34m43s057

(Images are a bit blurry as it's moving text, and cannot scroll back after kernel panic.)

P33M commented 6 years ago

Are you using the latest 4.14?

maxnet commented 6 years ago

Are you using the latest 4.14?

I am not using the very latest, but it was latest when I build the kernel last Wednesday, so recent enough to have your patch. (And yes, it's a custom kernel. But it's just bcm2709_defconfig + aufs patch, not anything that could remotely affect your dwc stuff)

paul-1 commented 6 years ago

I've seen this in as late as 4.14.26 kernel. (As reported https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=197689&start=25#p1265214) But as soon as I try to reproduce with pause_on_oops=20 in the command line, it becomes elusive.

It is definitely improved since your patch in #2382, but not eliminated.

maxnet commented 6 years ago

But as soon as I try to reproduce with pause_on_oops=20 in the command line, it becomes elusive.

Not so elusive to me. Here's with pause_on_oops with today's latest code (4.14.31 / b36f4e9e198477803d29861e02d3ea00fe5e09ab )

20180330_011018 20180330_011030

maxnet commented 6 years ago

In the forum thread jdb mentions:

I note that the USB register offset (xx98xxxx) appears nowhere in the stack or in the register dump.

However -at least in my case- r3 does actually point to the virtual address of the USB register offset.

It is 0xb88b0000 And that does happen to be the address ioremap returns here: https://github.com/raspberrypi/linux/blob/rpi-4.14.y/drivers/usb/host/dwc_otg/dwc_otg_driver.c#L735

On a kernel that has dynamic debugging enabled, I verified that by putting dyndbg="file dwc_otg_driver.c +p" in cmdline.txt It prints out:

[    4.555697] dwc_otg 3f980000.usb: dwc_otg_driver_probe(b4a74000)
[    4.555708] dwc_otg 3f980000.usb: start=0x3f980000 (len 0xffff)
[    4.555748] dwc_otg 3f980000.usb: base=0xb88b0000

I think the problem is that you are not supposed to be accessing memory in your FIQ that has been remapped after other kernel thread have started though, due to the issue described here: https://www.spinics.net/lists/arm-kernel/msg325250.html

Nilpo commented 6 years ago

I'm using berryboot-20180405-pi2-pi3 on a new Raspberry Pi 3 B+ and it will not boot at all. Tried several known working good cards. It won't even intialize video which means that a bootable OS isn't found. (Just black screen). An older version of BerryBoot booted to the rainbow screen but no further. I tried the same working card with the current version and got nowhere.

maxnet commented 6 years ago

It won't even intialize video

If you won't even get rainbow screen that would be a firmware issue, or something with power, partitioning/filesystem or SD card. In any case not related to the Linux kernel issue described here.

Nilpo commented 6 years ago

The same cards boot fine with Raspbian image. It's specific to the latest BerryBoot release. I'm using the same format utility for both.

maxnet commented 6 years ago

The same cards boot fine with Raspbian image.

Does Raspbian still work if you run rpi-update to update the firmware files to a newer version?

Note that the rainbow screen is drawn by the Raspberry Pi firmware files, not by the Linux kernel, nor by the Berryboot program.

Nilpo commented 6 years ago

I may have spoken too quickly. I'm unable to get the same Raspbian image to boot again. The same problem. The first time it booted must have been a fluke.

Note that the rainbow screen is drawn by the Raspberry Pi firmware files, not by the Linux kernel, nor by the Berryboot program.

Correct, but unless I'm misunderstanding, the firmware will not draw the rainbow screen unless it first recognizes a bootable SD card. (That's why it's one of the official troubleshooting steps.)

Specifically it indicates "The Raspberry Pi cannot find a valid image on the SD card." This means bad card, bad image, bad formatting, or in a worst-case a bad Pi/SD card reader.

I'm strongly leaning toward a hardware/firmware issue with the Pi itself at this point.

Nilpo commented 6 years ago

Walking away and coming back again allowed the Pi to boot. Heat? There are so many variables with inconsistent boots, but it does appear to be a problem in the Pi itself now.

maxnet commented 6 years ago

Correct, but unless I'm misunderstanding, the firmware will not draw the rainbow screen unless it first recognizes a bootable SD card.

Only to the extend that it must be able to find and boot the bootcode.bin and start.elf firmware files located on the SD card. Regardless what shape the kernel and berryboot.img files are in, it should display rainbow.

You can upgrade/downgrade to different firmware versions with rpi-update If you find a specific versions which causes problems, report it to https://github.com/raspberrypi/firmware In any case this thread is about a Linux kernel problem, and that does not apply if you do not get to rainbow.

derykmarl commented 6 years ago

I'm getting this too, but only if running an initramfs (which I was doing for overlay support, so that the Pi can be switched off at the wall without risking a corrupt filesystem). When that's disabled, it boots consistently.

I just re-imaged and set it up again with the same result... also the same after manually running rpi-update.

derykmarl commented 6 years ago

So far (after about 16 boots, and it was a lot less stable before) adding the kernel parameters dwc_otg.fiq_enable=0 and dwc_otg.fiq_fsm_enable=0 has worked around the bug, as per this post https://www.raspberrypi.org/forums/viewtopic.php?p=1273852#p1273852

macmpi commented 6 years ago

I'm getting this too, but only if running an initramfs

Interesting as @maxnet 's berryboot is also running out initramfs.

beren12 commented 6 years ago

Are you guys connecting to a TV or monitor when it's crashing? I might have a similar problem but there's lots of things that let it boot, just not with my dell monitor non-rotated.

paul-1 commented 6 years ago

In my experience it is more stable headless. But hdmi, or the rpi touchscreen increases the occurances. Stability seems to change with each kernel, as small changes in the kernel affect the boot ever so slightly.

beren12 commented 6 years ago

Oh see my bug is all or nothing. It doesn't work, but there's lots of things I can do to make it work, like tell the rpi to rotate the video, and it works. Or plug the monitor in after it boots and run tvservice -p, or use a tv and not a monitor, or the experimental vid overlay… https://github.com/raspberrypi/firmware/issues/980

macmpi commented 6 years ago

@pelwell this likely FIQ issue seems to be a 4.14 regression impacting several devices (Pi3b+, P2B,...) at boot with initramfs. Earlier Fix mentioned by jdb in forum does not seem to fix it for good: any suggestion? Thanks.

symbios24 commented 6 years ago

The kernel panic bug is active on raspberry pi 2 also only raspberry pi 3 Model B (not plus) does not seem to have it.

isgallagher commented 6 years ago

We are getting this same issue on Raspberry Pi 3 Model B (not plus) V1.2 with every kernel we have tried so far (4.14.15, 4.14.17, 4.14.57). We are using initramfs, dtoverlay, as well as cryptdevice option in cmdline.txt.

Setting dwc_otg.fig_enable=0 and dwc_otg.fiq_fsm_enable=0 does not help in our case.

maxnet commented 6 years ago

Setting dwc_otg.fig_enable=0 and dwc_otg.fiq_fsm_enable=0 does not help in our case.

It is fiq, not fig

And make sure you added dwc_otg.fiq_enable=0 dwc_otg.fiq_fsm_enable=0 on the same line as the existing options in cmdline.txt

isgallagher commented 6 years ago

@maxnet hi maxnet, yea that was a typo on my previous comment, I used the correct spelling on the cmdline.txt, and yes I added both switches to the same line.

With dtoverlay enabled, initramfs, (heat), and a monitor plugged in, this is very reproducible. The larger the resolution for the monitor, the more likely it will happen. E.g. 1920x1200 triggers it every time with or without heat, and 1366x768 happens sporadically when the Pi is hot (no heat-sync and in a hot room) but it will stop happening on 1366x768 when the Pi is cool enough (with heat sync and in a cool room).

maxnet commented 6 years ago

@maxnet hi maxnet, yea that was a typo on my previous comment, I used the correct spelling on the cmdline.txt, and yes I added both switches to the same line.

If you do cat /proc/cmdline on one of the occasions it does boot, it confirms the switches are there? And dmesg |grep FIQ also shows it is disabled?

Reason I am a bit skeptical is that it should never reach the functions mentioned in your crash dump if those parameters are there. Should not be able to error out on functions it does not run.

mutability commented 6 years ago

Another datapoint: I see this panic regularly with:

Pi 3B v1.2, no case, no heatsink
kernel 4.14.52-v7+, raspberrypi-kernel/raspberrypi-bootloader 1.20180703-1
with an initramfs configured
with a HDMI display connected
originally with a 3.5" LCD touchscreen & corresponding overlay, but still seen after removing those (perhaps less frequently)

Setting dwc_otg.fiq_enable=0 dwc_otg.fiq_fsm_enable=0 appears to work around it (up to ~30 reboots without a panic so far, without those options I'd see about 1-in-5 failures)

It's fairly fatal for remote/unattended installs as even if panic=60 is set, the Pi will fail to reboot and wedges entirely:

SMP: failed to stop secondary CPUs
Rebooting in 60 seconds..
SMP: failed to stop secondary CPUs
Reboot failed -- System halted

maxnet commented 6 years ago

Could someone that has a highres display+custom kernel+initramfs perhaps give the attached experimental kernel patch a try? Only have a 1080p screen myself, and the issue only occured occasionally with that before. So hard to tell if that I now no longer see it is thanks to my patch or not.

Patch creates a semi-static mapping for the USB registers early in the boot process before additional kernel threads are started, so all threads will have the mappings from the start, and no longer need data aborts to lazily update them before they do. (When ioremap() is called later in dwc_otg_driver.c the ioremap function will search existing static mappings first and return the one present automatically, so no changes needed in that file)

fiq-fix-patch.txt

pelwell commented 6 years ago

Reading the patch, I don't like the hard-coded virtual addresses, but I'll take your word for it that there isn't a better way.

Some independent confirmation that this patch is effective before I apply it would be great.

P33M commented 6 years ago

Skimming through the mailing list thread, wouldn't it be sufficient to force a page table walk by reading the USB address space on the nominated FIQ core (and have that function synchronously complete) prior to enabling the FIQ? hcd_init_fiq() is called in a non-preemptible but threaded context so taking a data abort shouldn't be fatal.

maxnet commented 6 years ago

Reading the patch, I don't like the hard-coded virtual addresses, but I'll take your word for it that there isn't a better way.

In theory there may be better ways, but this was the least invasive way I could think of.

E.g. https://www.spinics.net/lists/arm-kernel/msg325250.html suggests that maybe adding a function to the Linux memory management functions that does force updating all L1 page tables for all threads before continuing is an option:

What might be possible is to have a function which can be called in
these circumstances which ensures that a kernel address is accessible
to all threads in the system, though while it's doing that, it would
have to stop any fork() or exit() activity to be sure that it updated
every thread.

However modifying core kernel functions to achieve something like that is a bit too ambitious for me.

maxnet commented 6 years ago

Skimming through the mailing list thread, wouldn't it be sufficient to force a page table walk by reading the USB address space on the nominated FIQ core

Think each thread has a L1 page table of it's own, rather than each core. There is a mailing list post that suggests that walking threads, and prefaulting in each one didn't work in practice.

https://www.spinics.net/lists/arm-kernel/msg330296.html

I did an prefaulting for each available processes:
    for_each_process(process)
    {
        printk("process: %s [%d]\n",process->comm,process->pid);
        if(process->mm) {
            switch_mm(old_process->mm,process->mm,process);
            ioread32(priv->my_hardware);   // access the memory, prefault mmu
            old_process = process;
        }
    }
but still i get the the "Bad mode in data abort":

Not sure why that does not work though. Perhaps something needs to done to prevent that new threads can be created while the walk runs.

P33M commented 6 years ago

Hmm so which set of page tables does the FIQ end up using? Whichever set was in use by the thread immediately before the processor took the FIQ exception?

Edit: If this is the case, then we need to guarantee no other threads exist in the system when we do the ioremap() - since threads are spawned with copies of the caller's mm, which could be stale. Is there a way we can do this after the DT information is available but before the kernel spawns another thread?

maxnet commented 6 years ago

Whichever set was in use by the thread immediately before the processor took the FIQ exception?

Yes, I think so.

And if whatever it interrupted has the newest page tables -because the thread had a page fault a little bit earlier in normal kernel code-, everything is fine. Or if the thread was created later in time -after the ioremap() of the USB registers-, that is also fine. Corner case where you get the panic, is if it's an old thread with old page table.

==

Edit: If this is the case, then we need to guarantee no other threads exist in the system when we do the ioremap() - since threads are spawned with copies of the caller's mm, which could be stale. Is there a way we can do this after the DT information is available but before the kernel spawns another thread?

Think the DT information is not EXPANDED until after paging has been setup, and then threads are also starting to run. You do can use the FLAT device tree functions earlier, like done in my patch.

P33M commented 6 years ago

Hardcoding the base addresses in that patch won't work for bcm2835 products - the peripherals are at 0x20000000 not 0xf0000000. Is there any way to access the ranges property in the flat DT?

maxnet commented 6 years ago

Hardcoding the base addresses in that patch won't work for bcm2835 products - the peripherals are at 0x20000000 not 0xf0000000.

I am only hard coding the VIRTUAL address in the kernel's virtual memory layout, not the physical address.

Is there any way to access the ranges property in the flat DT?

Suggest you read the patch one more time. :-)

P33M commented 6 years ago

You're right - disregard that. Patch looks OK to me @pelwell ?

pelwell commented 6 years ago

Some independent confirmation would be nice, but the kernel still boots for me with it compiled in so I'm prepared to take a chance on until we find something better.

@maxnet Do you have a "Signed-off-by:" line you would like me to include in the commit?

paul-1 commented 6 years ago

Recently this has been rather sporadic, but we still see this problem. I can build new kernels today with this patch and do some testing.

maxnet commented 6 years ago

I can build new kernels today with this patch and do some testing.

If building kernel takes long for you, you can also grab binaries from: https://sourceforge.net/projects/berryboot/files/

Yesterday's 20180715 has the patch. 20180415 does not (so you should be able to reproduce the issue with those)

@maxnet Do you have a "Signed-off-by:" line you would like me to include in the commit?

You can add Signed-off-by: Floris Bos <bos@je-eigen-domein.nl> But may want to wait until someone confirms it actually solves the problem. Had several users (mostly with high resolution displays, QHD, 4k, etc.) claiming the problem was 100% reproducible to them before. So would be interesting if any of those is still around and can test it. (hoewever chances are they given up on it).

pelwell commented 6 years ago

You can add Signed-off-by: Floris Bos bos@je-eigen-domein.nl But may want to wait until someone confirms it actually solves the problem.

Thanks - I was just thinking ahead,

mutability commented 6 years ago

I have a release to get out the door first, but I reliably see the panic and will try to fit in some testing of the updated kernel

mutability commented 6 years ago

I can't reproduce the panic using the kernel from berryboot 20180715 (it survived an overnight run of rebooting about once a minute) however I also can't reproduce it using the kernel from berryboot 20180415 yet, so it's not conclusive.

I can reproduce it almost immediately using the kernel from raspberrypi-kernel 1.20180703-1

I'll build some kernels locally to see if I can test only the patch in isolation.

mutability commented 6 years ago

Building from rpi-4.14.y (at d407fc2 - I see HEAD has moved since I cloned), using bcm2709_defconfig and slotting the resulting kernel into my existing image with no other changes (it's kinda half broken in userspace because of missing modules etc, but it works enough to be able to ssh in and reboot):

Unpatched kernel panicked twice out of 5 boots Patched kernel with the patch in this comment is up to 13 reboots so far with no panics.

So the patch looks good here. I'll leave it running for a while longer to be sure. (edit: it was fine for a couple of hours worth of reboots)

paul-1 commented 6 years ago

Looking promising. Tested on custom 4.14.52 and 4.14.52-rt34 kernels. 3 different video setups..... No kernel panics.

Perhaps if I get time, I'll roll back and test kernels around 4.14.20, as they were much worse in terms of panics.

pelwell commented 6 years ago

That's good enough for me - the patch is now in rpi-4.14.y.

maxnet commented 6 years ago

Yes, seems to work. Thanks for the merge. Closing.

raspberrypi / linux

4.14: bad mode in data abort handler detected #2450