sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
718 stars 1.38k forks source link

Intermittent 202012 -> master warm upgrade error on 7060 and 7260 #15296

Open saiarcot895 opened 1 year ago

saiarcot895 commented 1 year ago

Description

When doing a warmboot upgrade from 202012 branch to master branch, it sometimes fails. Specifically, it appears that the master branch kernel fails to initialize and switch to userspace.

Some debugging showed that at least sometimes, there is a kernel panic that may be happening:

[    6.214792] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
[    6.214792] Shutting down cpus with NMI

The above message was seen either immediately after kexec or when the login prompt appears on the console (showing that, in those cases, it did switch to userspace, but something still caused it to panic.

What's odder is that a lab device that used to exhibit this issue consistently now doesn't appear to be hitting this issue. There was no configuration change, and no traffic was going through the device at any time.

Steps to reproduce the issue:

  1. Load a 202012 image on Arista 7060 or 7260 box
  2. Do a warm-reboot to master branch image

Describe the results you received:

Warmboot fails and watchdog (or kernel panic) causes reboot

Describe the results you expected:

Warmboot should be successful

Output of show version:

SONiC Software Version: SONiC.20201231.100
Distribution: Debian 10.13
Kernel: 4.19.0-12-2-amd64
Build commit: 66d18310c7
Build date: Tue May 23 01:46:12 UTC 2023
Built by: cloudtest@9474fdabc000000

Platform: x86_64-arista_7260cx3_64
HwSKU: Arista-7260CX3-D108C8
ASIC: broadcom
ASIC Count: 1
Serial Number: SSJ18423396
Uptime: 17:25:44 up  6:57,  2 users,  load average: 0.75, 0.77, 0.97

Docker images:
REPOSITORY                 TAG                 IMAGE ID            SIZE
docker-mux                 20201231.100        305ac44d68d9        383MB
docker-mux                 latest              305ac44d68d9        383MB
docker-sonic-restapi       20201231.100        e55eb78283fc        350MB
docker-sonic-restapi       latest              e55eb78283fc        350MB
docker-sonic-telemetry     20201231.100        3692963e2f25        420MB
docker-sonic-telemetry     latest              3692963e2f25        420MB
docker-orchagent           20201231.100        ed45c6c1e19e        360MB
docker-orchagent           latest              ed45c6c1e19e        360MB
docker-teamd               20201231.100        202b1d7f5ff7        342MB
docker-teamd               latest              202b1d7f5ff7        342MB
docker-fpm-frr             20201231.100        38aa23cc5e8f        361MB
docker-fpm-frr             latest              38aa23cc5e8f        361MB
docker-platform-monitor    20201231.100        3ded45f89414        516MB
docker-platform-monitor    latest              3ded45f89414        516MB
docker-acms                20201231.100        4ec95c3f5a6a        435MB
docker-acms                latest              4ec95c3f5a6a        435MB
docker-snmp                20201231.100        239561269a3c        374MB
docker-snmp                latest              239561269a3c        374MB
docker-lldp                20201231.100        77d1c540f920        371MB
docker-lldp                latest              77d1c540f920        371MB
docker-database            20201231.100        2d703e57c546        331MB
docker-database            latest              2d703e57c546        331MB
docker-dhcp-relay          20201231.100        c8aba90e0a45        347MB
docker-dhcp-relay          latest              c8aba90e0a45        347MB
docker-syncd-brcm          20201231.100        68d2d719c7db        623MB
docker-syncd-brcm          latest              68d2d719c7db        623MB
docker-router-advertiser   20201231.100        a16b7b4e7866        331MB
docker-router-advertiser   latest              a16b7b4e7866        331MB
k8s.gcr.io/pause           3.5                 ed210e3e4a5b        683kB

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

judyjoseph commented 1 year ago

Sai, working with Arista