sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
706 stars 1.35k forks source link

[DELL Z9432] TD4[BCM56881_B0] NPU bring-up issue #11939

Open srideepDell opened 1 year ago

srideepDell commented 1 year ago

Description

current version of SAI/SDK supports BCM56881_B0. While bring up we did hit issue of syncd exiting while configuring npu. Logs and other details attached below.

BRCM SAI ver: [7.1.0.0], OCP SAI ver: [1.10.2], SDK ver: [sdk-6.5.24]

Steps to reproduce the issue:

  1. Boot latest sonic image on TD4 ( Ex- Z9432)
  2. syncd crash will be seen in logs.

Describe the results you received:

syncd crash seen Logs from syncd.

Aug 31 23:24:05.868960 sonic NOTICE syncd#syncd: :- processOidCreate: creating switch number 1
Aug 31 23:24:05.899421 sonic INFO syncd#supervisord: syncd Found 1 device.#015
Aug 31 23:24:05.899522 sonic INFO syncd#supervisord: syncd Unit 0: BCM56881#015
Aug 31 23:24:05.899556 sonic INFO syncd#supervisord: syncd NGBDE unit 0 (PCI), Dev 0xb881, Rev 0x11, Chip BCM56881_B0#015
Aug 31 23:24:05.915407 sonic INFO syncd#supervisord: syncd cp: cannot stat '/var/warmboot/brcm_bcm_scache': No such file or directory#015
Aug 31 23:24:05.950422 sonic INFO syncd#supervisord: syncd rc: unit 0 device BCM56881_B0#015
Aug 31 23:24:08.376118 sonic INFO bgp#supervisord: fpmsyncd Connected!
Aug 31 23:24:08.376235 sonic NOTICE bgp#fpmsyncd: :- setWarmStartState: bgp warm start state changed to disabled
**Aug 31 23:24:09.004409 sonic NOTICE syncd#dsserve: child /usr/bin/syncd exited status: 135
Aug 31 23:24:09.004858 sonic INFO syncd#supervisord: syncd [5] child /usr/bin/syncd exited status: 135
Aug 31 23:24:09.005205 sonic INFO syncd#supervisord 2022-08-31 23:24:09,004 INFO exited: syncd (exit status 3; not expected)**

Core file bt

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/bin/syncd --diag -u -s -p /etc/sai.d/sai.profile -b /tmp/break_before_make'.
Program terminated with signal SIGBUS, Bus error.
#0  0x00007fca71f74ad3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fca70327700 (LWP 44))]
(gdb) bt
#0  0x00007fca71f74ad3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fca7ab68747 in bcmptm_do_ha_alloc () from /usr/lib/libsai.so.1
#2  0x00007fca7abb18c1 in bcmptm_ptcache_init () from /usr/lib/libsai.so.1
#3  0x00007fca7a923a73 in ?? () from /usr/lib/libsai.so.1
#4  0x00007fca7a924aa0 in ?? () from /usr/lib/libsai.so.1
#5  0x00007fca7a7f5669 in ?? () from /usr/lib/libsai.so.1
#6  0x00007fca7a7f5ed9 in shr_sysm_instance_thread () from /usr/lib/libsai.so.1
#7  0x00007fca79823ce5 in ?? () from /usr/lib/libsai.so.1
#8  0x00007fca7a78c54f in ?? () from /usr/lib/libsai.so.1
#9  0x00007fca72340ea7 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007fca71fcadef in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb)

Describe the results you expected:

Expected for the NPU to come up and launch brcm sdk shell.

Output of show version:

root@sonic:~# show version

SONiC Software Version: SONiC.master.138555-fb774dd46
Distribution: Debian 11.4
Kernel: 5.10.0-12-2-amd64
Build commit: fb774dd46
Build date: Tue Aug 23 16:46:29 UTC 2022
Built by: AzDevOps@sonic-build-workers-001ZGP

Platform: x86_64-dellemc_z9432f_c3758-r0
HwSKU: DellEMC-Z9432f-O32
ASIC: broadcom
ASIC Count: 1
srideepDell commented 1 year ago

9432_TD4_community_sonic_logs.txt

prgeor commented 1 year ago

SIGBUS error indicates that there is some issue in accessing the mmaped region. Are you able to do PCI r/w in configuration and memory mapped region?

srideepDell commented 1 year ago

@prgeor yes we are able to access the PCI configuration and memory mapped region. the gdb traces/logs are just to indicate issue is in libsai threads which we will not be able to debug since platform vendors don't have access to it.

Logs indicate chip is detected initialized and some callbacks from the sdk is not returning and timing out. similar issues are reported by other HW vendors too https://github.com/sonic-net/sonic-buildimage/issues/8216


Aug 31 23:24:05.899421 sonic INFO syncd#supervisord: syncd Found 1 device.#015
Aug 31 23:24:05.899522 sonic INFO syncd#supervisord: syncd Unit 0: BCM56881#015
Aug 31 23:24:05.899556 sonic INFO syncd#supervisord: syncd NGBDE unit 0 (PCI), Dev 0xb881, Rev 0x11, Chip BCM56881_B0#015
Aug 31 23:24:05.915407 sonic INFO syncd#supervisord: syncd cp: cannot stat '/var/warmboot/brcm_bcm_scache': No such file or directory#015
Aug 31 23:24:05.950422 sonic INFO syncd#supervisord: syncd rc: unit 0 device BCM56881_B0#015
arunlk-dell commented 2 months ago

@prgeor With the latest master image we are seeing different failures in TD4 bring up. TD4_Community_Sonic_logs.txt

Logs to be noted.. ` Apr 25 05:59:10.616460 sonic INFO syncd#supervisord: syncd Found 1 device.# 015 Apr 25 05:59:10.616460 sonic INFO syncd#supervisord: syncd Unit 0: BCM56881#015 Apr 25 05:59:10.617467 sonic INFO syncd#supervisord: syncd NGBDE unit 0 (PCI), Dev 0xb881, Rev 0x11, Chip BCM56881_B0#015 Apr 25 05:59:10.649398 sonic INFO syncd#supervisord: syncd rc: unit 0 device BCM56881_B0#015 Apr 25 05:59:10.662513 sonic INFO syncd#supervisord: syncd :bcmcfg_parse_loop: :27:22: syntax error.# 015

`

Apr 25 05:59:12.935999 sonic WARNING kernel: [ 87.357839] Unmatched linux_ngknet.ko, please use the latest.

Apr 25 05:59:13.629954 sonic CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:3204 setting inter-frame gap failed with error Feature unavailable (0xfffffff0).