platinasystems / go

Other
9 stars 68 forks source link

Tracking for kernel panics on warm-boots #58

Open donnlee opened 7 years ago

donnlee commented 7 years ago

This is github issue for tracking kernel panics during invader warm-boots. To clear the problem, we have been issuing power-cycle cold-boots.

After upgrading goes to http://downloads.platinasystems.com/LATEST/goes-platina-mk1 today, I warm-booted after seeing strange gobgp behavior:

$ gobgp nei
grpc: timed out when dialing

Warm-boot (console) said:

...
[  OK  ] Started Wait for Plymouth Boot Screen to Quit.
[  OK  ] Started Login Service.
         Starting Getty on tty1...
[  OK  ] Started Getty on tty1.
         Starting Serial Getty on ttyS0...
[  OK  ] Started Serial Getty on ttyS0.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started LSB: manage the statistics collection daemon.
[  OK  ] Started LSB: exim Mail Transport Agent.
[  OK  ] Started Network fabric for containers.
         Starting Docker Application Container Engine...

Debian GNU/Linux 8 invader8 ttyS0

invader8 login: Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Shutting down cpus with NMI
Kernel Offset: disabled
Rebooting in 30 seconds..
<wedged here for much longer than 30s>

I power-cycled to recover.

donn@invader8:~$ goes hget platina packages | grep version
        Free Software Foundation; either version 2 of the License, or (at your
        option) any later version.
          9. The Free Software Foundation may publish revised and/or new versions
        of the General Public License from time to time.  Such new versions will
        be similar in spirit to the present version, but may differ in detail to
        Each version is given a distinguishing version number.  If the Program
        specifies a version number of this License which applies to it and "any
        later version", you have the option of following the terms and conditions
        either of that version or of any later version published by the Free
        Software Foundation.  If the Program does not specify a version number of
        this License, you may choose any version ever published by the Free Software
    version: dcb42afc1f93cfd8a6d1d4cdb8c6549f37d3761f
    version: cdb4a934821e777dfb59d89cfc09151dcfd3792c
    version: 60f39141fbbf78ddb2260dba74c68f2789374f18

Commit: https://github.com/platinasystems/go/commit/dcb42afc1f93cfd8a6d1d4cdb8c6549f37d3761f

jasonlpang commented 7 years ago

Hi Donn, This could be a duplicate of https://github.com/platinasystems/go/issues/4 https://github.com/platinasystems/go/issues/. Since you are seeing this on invader8, refer to the text in red below regarding alpha units.

This issue is caused by the x86 performing a cold reboot which power cycles the x86. This power cycle causes the TH to require a hard reset to come back up on PCIe.

With AMI BIOS, in debian "reboot" command performs a cold reboot while the "reboot -f" command performs a warm reboot. With coreboot, the intel FSP will perform a cold reboot regardless of linux side command. We are working with intel to figure out how to support warm reboot with the FSP.

In the mean time, with Alpha units (invader1-15) to avoid this issue use AMI BIOS and "reboot -f" or issue the following commands to hard reset the TH before launching goes: sudo i2cset -y 0x0 0x74 0x6 0xf8 sudo i2cset -y 0x0 0x74 0x2 0xfc sudo i2cset -y 0x0 0x74 0x2 0xff sudo echo 1 > /sys/bus/pci/rescan sudo rmmod uio_pci_dma sudo insmod /uio_pci_dma.ko

With Beta units and beyond (invader16-24), this issue does not occur with CPLD v8 or higher or with the "reboot -f" command with lower CPLD versions. Check CPLD version with the ioget tool (sudo ./ioget 0x600).

https://github.com/platinasystems/go/issues/ t https://github.com/platinasystems/go/issues/4hanks Jason

https://github.com/platinasystems/go/issues/4

stigt commented 7 years ago

This bug mentions an Intel firmware update fixes it -https://bugzilla.redhat.com/show_bug.cgi?id=1293901

jasonlpang commented 7 years ago

I did try the latest intel microcode and unfortunately this particular problem still occurs. I think the "Kernel panic - not syncing: Timeout: Not all cpus entered broadcast exception handler” message may be generated for various reasons that cause a panic. In our case it’s a PCIe timeout under certain conditions that’s causing the panic.

-jason