sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
723 stars 1.38k forks source link

Failed to install SONiC on physical switch #5596

Closed Blueve closed 3 years ago

Blueve commented 3 years ago

Description

I am trying to install a SONiC image (from recent master branch: https://sonic-jenkins.westus2.cloudapp.azure.com/job/broadcom/job/buildimage-brcm-all/436/) to my switch, but it hanging forever and not able to login.

I met this issue before, I power cycled the switch and rolled back to previous image to get my switch back. But I want to test my code with current master branch code, so I have to use new image now.

How to diagnostic this issue?

Steps to reproduce the issue:

  1. Download the image
  2. Run sudo sonic-installer install http://10.1.100.70/images/jika/sonic-broadcom-pyw.bin on switch

Describe the results you received:

The installation hanging forever and the last command line output is shown below:

admin@sonic:/etc/sonic$ sudo sonic-installer install http://10.1.100.70/images/jika/sonic-broadcom-pyw.bin
New image will be installed, continue? [y/N]: y
Downloading image...
...98%, 845 MB, 14923 KB/s, 0 seconds left...    
Installing image SONiC-OS-master.0-d03de95e and setting it as default...
Command: bash /tmp/sonic_image
tar: installer/install.sh: time stamp 2020-10-10 07:35:23 is 50612951.78864607 s in the future
tar: installer/onie-image-arm64.conf: time stamp 2020-10-10 07:35:23 is 50612951.781260198 s in the future
tar: installer/machine.conf: time stamp 2020-10-10 07:35:24 is 50612952.781117326 s in the future
tar: installer/platform.conf: time stamp 2020-10-10 07:35:24 is 50612952.781016766 s in the future

Trying to connect to the switch via console, blank console in most time. Sometimes I got

sonic login:
sonic login:
sonic login:

But not interactable.

10/14/2020 update

I am trying to install a newer image: https://sonic-jenkins.westus2.cloudapp.azure.com/job/broadcom/job/buildimage-brcm-all/443/artifact/target/ It stuck again, but I caught few logs, it might helpful for trouble shooting?

[ 1105.436434] perf: interrupt took too long (2520 > 2500), lowering kernel.perf_event_max_sample_rate to 79250
[ 1386.398189] perf: interrupt took too long (3182 > 3150), lowering kernel.perf_event_max_sample_rate to 62750
[ 1801.423365] perf: interrupt took too long (3978 > 3977), lowering kernel.perf_event_max_sample_rate to 50250

Since it stuck too much time, I power-cycled the switch and boot in ONIE to install the same image. The log is shown below:

[    8.630723] rc.local[396]: + cat /etc/sonic/sonic_version.yml
[    8.706593] rc.local[396]: + grep build_version
[    8.768138] rc.local[396]: + sed -e s/build_version: //g;s/'//g
[    8.842740] rc.local[396]: + SONIC_VERSION=master.443-bba5df05
[    8.922101] rc.local[396]: + FIRST_BOOT_FILE=/host/image-master.443-bba5df05/platform/firsttime
[    9.027126] rc.local[396]: + SONIC_CONFIG_DIR=/host/image-master.443-bba5df05/sonic-config
[    9.135884] rc.local[396]: + SONIC_ENV_FILE=/host/image-master.443-bba5df05/sonic-config/sonic-environment
[    9.264329] rc.local[396]: + [ -d /host/image-master.443-bba5df05/sonic-config -a -f /host/image-master.443-bba5df05/sonic-config/sonic-environment ]
[    9.433779] rc.local[396]: + logger SONiC version master.443-bba5df05 starting up...
[    9.529919] rc.local[396]: + grub_installation_needed=
[    9.604984] rc.local[396]: + [ ! -e /host/machine.conf ]
[    9.674723] rc.local[396]: + migrate_nos_configuration
[    9.745793] rc.local[396]: + rm -rf /host/migration
[    9.805791] rc.local[396]: + mkdir -p /host/migration
[    9.894118] rc.local[396]: + cat /proc/cmdline
[    9.971944] rc.local[396]: + set -- BOOT_IMAGE=/image-master.443-bba5df05/boot/vmlinuz-4.19.0-9-2-amd64 root=UUID=7124e1b2-b403-4c75-8ff7-4e269197bfdd rw console=tty0 console=ttyS1,9600n8 quiet net.ifnames=0 biosdevname=0 loop=image-master.443-bba5df05/fs.squashfs loopfstype=squashfs apparmor=1 security=apparmor varlog_size=4096 usbcore.autosuspend=-1 module_blacklist=gpio_ich
[   10.377791] rc.local[396]: + [ -n  ]
[   10.429791] rc.local[396]: + . /host/machine.conf
[   10.489849] rc.local[396]: + onie_version=2015.05.0.0.3
[   10.562027] rc.local[396]: + onie_vendor_id=12244
[   10.621799] rc.local[396]: + onie_platform=x86_64-cel_e1031-r0
[   10.698736] rc.local[396]: + onie_machine=cel_e1031
[   10.757842] rc.local[396]: + onie_machine_rev=0
[   10.818917] rc.local[396]: + onie_arch=x86_64
[   10.881299] rc.local[396]: + onie_config_version=1
[   10.941861] rc.local[396]: + onie_build_date=2018-05-14T19:01-0400
[   11.018152] rc.local[396]: + onie_partition_type=gpt
[   11.078396] rc.local[396]: + onie_kernel_version=3.2.35
[   11.150096] rc.local[396]: + program_console_speed
[   11.218778] rc.local[396]: + grep -Eo console=ttyS[0-9]+,[0-9]+
[   11.297414] rc.local[396]: + cat /proc/cmdline
[   11.358307] kdump-tools[393]: Starting kdump-tools: no crashkernel= parameter in the kernel cmdline ... failed!
[   11.483789] rc.local[396]: + cut -d , -f2
[   11.541325] rc.local[396]: + speed=9600
[   11.593830] rc.local[396]: + [ -z 9600 ]
[   11.650927] rc.local[396]: + CONSOLE_SPEED=9600
[   11.710765] rc.local[396]: + sed -i s|\-\-keep\-baud .* %I| 9600 %I|g /lib/systemd/system/serial-getty@.service
[   11.837802] rc.local[396]: + systemctl daemon-reload
[   11.899651] rc.local[396]: + [ -f /host/image-master.443-bba5df05/platform/firsttime ]
[   12.001814] rc.local[396]: + echo First boot detected. Performing first boot tasks...
[   12.109867] rc.local[396]: First boot detected. Performing first boot tasks...
[   12.202611] rc.local[396]: + [ -n  ]
[   12.250351] rc.local[396]: + [ -n x86_64-cel_e1031-r0 ]
[   12.322319] rc.local[396]: + platform=x86_64-cel_e1031-r0
[   12.394769] rc.local[396]: + [ -d /host/old_config ]
[   12.465802] rc.local[396]: + [ -f /host/minigraph.xml ]
[   12.537844] rc.local[396]: + [ -n  ]
[   12.581856] rc.local[396]: + touch /tmp/pending_config_initialization
[   12.671127] rc.local[396]: + touch /tmp/notify_firstboot_to_platform
[   12.752760] rc.local[396]: + [ ! -d /host/reboot-cause/platform ]
[   12.829851] rc.local[396]: + mkdir -p /host/reboot-cause/platform
[   12.905824] rc.local[396]: + [ -d /host/image-master.443-bba5df05/platform/x86_64-cel_e1031-r0 ]
[   13.015177] rc.local[396]: + dpkg -i /host/image-master.443-bba5df05/platform/x86_64-cel_e1031-r0/platform-modules-haliburton_0.9_amd64.deb
[   13.172432] rc.local[396]: Selecting previously unselected package platform-modules-haliburton.
[   13.281850] rc.local[396]: (Reading database ... 30037 files and directories currently installed.)
[   13.394249] rc.local[396]: Preparing to unpack .../platform-modules-haliburton_0.9_amd64.deb ...
[   13.501817] rc.local[396]: Unpacking platform-modules-haliburton (0.9) ...
[   13.589894] rc.local[396]: Setting up platform-modules-haliburton (0.9) ...
[   19.109655] rc.local[396]: Synchronizing state of platform-modules-haliburton.service with SysV service script with /lib/systemd/systemd-sysv-install.
[   19.281856] rc.local[396]: Executing: /lib/systemd/systemd-sysv-install enable platform-modules-haliburton
[   25.291791] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[   25.381489] BUG: unable to handle kernel paging request at ffffffffc073c098
[   25.464931] PGD 6a60e067 P4D 6a60e067 PUD 6a610067 PMD 78b13067 PTE 8000000079dc0063
[   25.557750] Oops: 0011 [#1] SMP PTI
[   25.599529] CPU: 0 PID: 641 Comm: platform-module Tainted: G           OE     4.19.0-9-2-amd64 #1 Debian 4.19.118-2+deb10u1
[   25.732972] Hardware name: Celestica E1031/E1031, BIOS E1031010 06/25/2018
[   25.815379] RIP: 0010:__this_module+0x58/0xffffffffffffcfc0 [pmbus_core]
[   25.895692] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30 0c 4e 39 55 8d ff ff <98> b2 71 c0 ff ff ff ff d8 35 73 c0 ff ff ff ff b8 32 52 3b 55 8d
[   26.120806] RSP: 0018:ffffb5ca00627b38 EFLAGS: 00010282
[   26.183412] RAX: ffffffffc073c098 RBX: ffff8d5538af3400 RCX: 0000000000000000
[   26.268934] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8d5538af3400
[   26.298474] ismt_smbus 0000:00:13.0: completion wait timed out
[   26.354454] RBP: 0000000000000000 R08: ffff8d553b4c0000 R09: 0000000000000000
[   26.354456] R10: 0000000000100000 R11: 00000000000028ce R12: ffff8d5538af3818
[   26.354458] R13: ffff8d5538af3400 R14: ffff8d5538af3818 R15: ffff8d5538af3420
[   26.354460] FS:  00007fc2f7294740(0000) GS:ffff8d553c000000(0000) knlGS:0000000000000000
[   26.354462] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.354464] CR2: ffffffffc073c098 CR3: 000000007933c000 CR4: 00000000001006f0
[   26.354466] Call Trace:
[   26.354474]  ? _pmbus_write_byte.constprop.18+0x30/0x50 [pmbus_core]
[   26.354481]  ? pmbus_clear_faults+0x30/0x50 [pmbus_core]
[   26.354488]  ? pmbus_do_probe+0x21e/0xda0 [pmbus_core]
[   26.424525] max6697: probe of 11-001a failed with error -110
[   26.509915]  ? _cond_resched+0x15/0x30
[   26.509920]  ? kmem_cache_alloc_trace+0x15e/0x1e0
[   26.509923]  ? _cond_resched+0x15/0x30
[   26.509927]  ? kernfs_activate+0x63/0x80
[   26.509930]  ? kernfs_add_one+0xe7/0x130
[   26.509934]  ? 0xffffffffc0719000
[   26.509938]  ? i2c_device_probe+0x183/0x270
[   26.509943]  ? really_probe+0x24b/0x3b0
[   26.509949]  ? __driver_attach+0x110/0x110
[   27.655681]  ? driver_probe_device+0xb3/0xf0
[   27.706832]  ? __driver_attach+0x110/0x110
[   27.755902]  ? bus_for_each_drv+0x76/0xc0
[   27.803929]  ? __device_attach+0xd9/0x150
[   27.851956]  ? bus_probe_device+0x8a/0xa0
[   27.899983]  ? device_add+0x399/0x690
[   27.943845]  ? i2c_new_device+0x15c/0x360
[   27.991871]  ? i2c_sysfs_new_device+0x109/0x2b0
[   28.046155]  ? kernfs_fop_write+0x116/0x190
[   28.096268]  ? vfs_write+0xa5/0x1a0
[   28.138043]  ? ksys_write+0x57/0xd0
[   28.179823]  ? do_syscall_64+0x53/0x110
[   28.225765]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   28.288376] Modules linked in: mc24lc64t(OE) max6697(E) i2c_mux_pca954x(E) i2c_mux(E) dps200(OE) pmbus_core(E) hlx_gpio_ich(OE) smc(OE) nft_counter(E) xt_conntrack(E) nf_nat(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xfrm_user(E) xfrm_algo(E) linux_knet_cb(OE) linux_bcm_knet(OE) psample(OE) linux_user_bde(OE) linux_kernel_bde(OE) evdev(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) intel_cstate(E) wdat_wdt(E) pcspkr(E) bonding(E) sg(E) pcc_cpufreq(E) button(E) acpi_cpufreq(E) cp210x(E) usbserial(E) ebtable_broute(E) bridge(E) stp(E) llc(E) ebtable_nat(E) ebtable_filter(E) ebtables(E) nf_tables(E) nfnetlink(E) ip_tables(E) x_tables(E) autofs4(E) loop(E) ext4(E) crc16(E)
[   29.133346]  mbcache(E) jbd2(E) crc32c_generic(E) fscrypto(E) ecb(E) nls_utf8(E) nls_cp437(E) nls_ascii(E) vfat(E) fat(E) overlay(E) squashfs(E) zstd_decompress(E) xxhash(E) sd_mod(E) crc32c_intel(E) aesni_intel(E) ahci(E) libahci(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) ehci_pci(E) ehci_hcd(E) i2c_i801(E) lpc_ich(E) mfd_core(E) libata(E) igb(E) i2c_algo_bit(E) dca(E) usbcore(E) scsi_mod(E) usb_common(E) i2c_ismt(E)
[   29.591832] CR2: ffffffffc073c098
[   29.631527] ---[ end trace 2dbc4edc589a9155 ]---

And then it stuck and no any response on console. @yxieca met same issue today. I power-cycled the switch again, and I can enter the OS now without issue.

Describe the results you expected:

SONiC OS installed with proper command line output feedback. Able to login to the SONiC OS after installed.

Additional information you deem important (e.g. issue happens only occasionally):

Blueve commented 3 years ago

Update with more information.

admin@sonic:~$ show version

SONiC Software Version: SONiC.master.443-bba5df05
Distribution: Debian 10.6
Kernel: 4.19.0-9-2-amd64
Build commit: bba5df05
Build date: Tue Oct 13 19:50:56 UTC 2020
Built by: johnar@jenkins-worker-11

Platform: x86_64-cel_e1031-r0
HwSKU: Celestica-E1031-T48S4
ASIC: broadcom
/usr/local/bin/decode-syseeprom : ERROR : Failed to read eeprom : [Errno 2] No such file or directory: '/sys/class/i2c-adapter/i2c-2/2-0050/eeprom'
Serial Number: 
Uptime: 10:54:17 up 42 min,  1 user,  load average: 1.43, 1.48, 1.46

Docker images:
REPOSITORY                    TAG                   IMAGE ID            SIZE
docker-teamd                  latest                bd1df6fe3cbc        394MB
docker-teamd                  master.443-bba5df05   bd1df6fe3cbc        394MB
docker-nat                    latest                2db062aaf7f7        396MB
docker-nat                    master.443-bba5df05   2db062aaf7f7        396MB
docker-router-advertiser      latest                44f82f06d9e3        359MB
docker-router-advertiser      master.443-bba5df05   44f82f06d9e3        359MB
docker-platform-monitor       latest                4c51bd437497        441MB
docker-platform-monitor       master.443-bba5df05   4c51bd437497        441MB
docker-lldp                   latest                ea22f55f7cff        388MB
docker-lldp                   master.443-bba5df05   ea22f55f7cff        388MB
docker-database               latest                a87c7ce0bda1        359MB
docker-database               master.443-bba5df05   a87c7ce0bda1        359MB
docker-orchagent              latest                9ed19e1a8b00        407MB
docker-orchagent              master.443-bba5df05   9ed19e1a8b00        407MB
docker-dhcp-relay             latest                7d5aaaeacf8c        366MB
docker-dhcp-relay             master.443-bba5df05   7d5aaaeacf8c        366MB
docker-sonic-telemetry        latest                a8c51245a513        429MB
docker-sonic-telemetry        master.443-bba5df05   a8c51245a513        429MB
docker-sonic-mgmt-framework   latest                cab64d10b3ac        486MB
docker-sonic-mgmt-framework   master.443-bba5df05   cab64d10b3ac        486MB
docker-sflow                  latest                ee01daa4595c        397MB
docker-sflow                  master.443-bba5df05   ee01daa4595c        397MB
docker-fpm-frr                latest                a9df2f6c4db5        410MB
docker-fpm-frr                master.443-bba5df05   a9df2f6c4db5        410MB
docker-snmp                   latest                2ffc7ed302ef        399MB
docker-snmp                   master.443-bba5df05   2ffc7ed302ef        399MB
docker-syncd-brcm             latest                b826fbfdc0c0        542MB
docker-syncd-brcm             master.443-bba5df05   b826fbfdc0c0        542M
anshuv-mfst commented 3 years ago

Hi @Blueve - please connect with Celestica team offline. cc: @yxieca

lguohan commented 3 years ago

check out this message. it looks like a platform driver issue on celestica platform.

[   25.291791] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[   25.381489] BUG: unable to handle kernel paging request at ffffffffc073c098
[   25.464931] PGD 6a60e067 P4D 6a60e067 PUD 6a610067 PMD 78b13067 PTE 8000000079dc0063
[   25.557750] Oops: 0011 [#1] SMP PTI
[   25.599529] CPU: 0 PID: 641 Comm: platform-module Tainted: G           OE     4.19.0-9-2-amd64 #1 Debian 4.19.118-2+deb10u1
[   25.732972] Hardware name: Celestica E1031/E1031, BIOS E1031010 06/25/2018
[   25.815379] RIP: 0010:__this_module+0x58/0xffffffffffffcfc0 [pmbus_core]
[   25.895692] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30 0c 4e 39 55 8d ff ff <98> b2 71 c0 ff ff ff ff d8 35 73 c0 ff ff ff ff b8 32 52 3b 55 8d
[   26.120806] RSP: 0018:ffffb5ca00627b38 EFLAGS: 00010282
[   26.183412] RAX: ffffffffc073c098 RBX: ffff8d5538af3400 RCX: 0000000000000000
[   26.268934] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8d5538af3400
[   26.298474] ismt_smbus 0000:00:13.0: completion wait timed out
[   26.354454] RBP: 0000000000000000 R08: ffff8d553b4c0000 R09: 0000000000000000
[   26.354456] R10: 0000000000100000 R11: 00000000000028ce R12: ffff8d5538af3818
[   26.354458] R13: ffff8d5538af3400 R14: ffff8d5538af3818 R15: ffff8d5538af3420
[   26.354460] FS:  00007fc2f7294740(0000) GS:ffff8d553c000000(0000) knlGS:0000000000000000
[   26.354462] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.354464] CR2: ffffffffc073c098 CR3: 000000007933c000 CR4: 00000000001006f0
[   26.354466] Call Trace:
[   26.354474]  ? _pmbus_write_byte.constprop.18+0x30/0x50 [pmbus_core]
[   26.354481]  ? pmbus_clear_faults+0x30/0x50 [pmbus_core]
[   26.354488]  ? pmbus_do_probe+0x21e/0xda0 [pmbus_core]
[   26.424525] max6697: probe of 11-001a failed with error -110
Blueve commented 3 years ago

Update From: Pradchaya Phucharoen (Celestica)

the issue occurs cause by out-of-tree DPS200 PSU driver. Because the driver uses an old copy of drivers/hwmon/pmbus/pmbus.h. When the kernel patched by this patch it causes the file to be different.

The best way to fix this is to merge the DPS200 into sonic-Linux-kernel instead of keeping it as a platform driver. The temporary fix is to update pmbus.h in the platform source codes with the one patched in sonic-linux-kernel.

Blueve commented 3 years ago

Update I tested install the image after build 492 and I found that it can reboot in SONiC without hanging now. But still seen some issues might relating to this topic: