sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
718 stars 1.38k forks source link

[202405][DNX] orchagent exited because of failing to set SAI_NEIGHBOR_ENTRY_ATTR_IS_LOCAL #19592

Closed ysmanman closed 1 week ago

ysmanman commented 1 month ago

Description

We noticed following orchagent failure in T2 testing with 202405 image.

2024 Jul 13 13:59:29.646591 xxx405-3 INFO swss#supervisord: orchagent
2024 Jul 13 13:59:40.815226 xxx405-3 ERR syncd#syncd: [none] SAI_API_NEIGHBOR:brcm_sai_set_neighbor_entry_attribute:597 Error processing nbr entry attribute failed with error Unknown error (0xfffd0000).
2024 Jul 13 13:59:40.815226 xxx405-3 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_SET failed in syncd mode: SAI_STATUS_FAILURE
2024 Jul 13 13:59:40.815383 xxx405-3 ERR syncd#syncd: :- processQuadEvent: attr: SAI_NEIGHBOR_ENTRY_ATTR_IS_LOCAL: false
2024 Jul 13 13:59:40.815875 xxx405-3 INFO swss#supervisord: orchagent
2024 Jul 13 13:59:40.815875 xxx405-3 ERR swss#orchagent: :- set: set status: SAI_STATUS_FAILURE
2024 Jul 13 13:59:40.815875 xxx405-3 ERR swss#orchagent: :- addNeighbor: Failed to update neighbor 00:11:22:33:44:55 on nfc405-7|Asic0|Ethernet180, attr.id=0x8, rv:-1
2024 Jul 13 13:59:40.815875 xxx405-3 ERR swss#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_NEIGHBOR, status: SAI_STATUS_FAILURE

The failure was observed with arp/test_neighbor_mac_noptf.py and arp/test_arpall.py.

Steps to reproduce the issue:

1. 2. 3.

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

ysmanman commented 1 month ago

@arlakshm @kenneth-arista

ysmanman commented 1 month ago

Looked at BRCM SAI code and it seems neither 202205 nor 202405 SAI supports setting SAI_NEIGHBOR_ENTRY_ATTR_IS_LOCAL for neighbor entry. But we didn't see the failure in 202205 testing. Maybe SONiC behaviors in 202205 and 202405 are different.

ysmanman commented 1 month ago

FYI, CSP CS00012298563 confirmed BRCM SAI did not support setting SAI_NEIGHBOR_ENTRY_ATTR_IS_LOCAL in SAI 9.x (or maybe earlier version too).

arlakshm commented 1 month ago

Thanks @ysmanman for reporting this issue. @saksarav-nokia, @mlok-nokia for viz..

arlakshm commented 1 month ago

@ysmanman, I did a quick check on the SAI definition. This attribute supports create and set. Any reason why the SAI behavior was changed? image

https://github.com/opencomputeproject/SAI/blob/dff0e34511e9a9018ee81743c95015f41a3f8c47/inc/saineighbor.h#L144-L156

ysmanman commented 1 month ago

Hi @arlakshm , I don't have too much context on why BRCM discontinued supporting setting SAI_NEIGHBOR_ENTRY_ATTR_IS_LOCAL at least starting from SAI 9.2. But based on the conversion in CSP CS00012298563, there were some discussion between MSFT and BRCM as well. Quote the reply from BRCM:


update:

at the meeting with MSFT, they are asking if SAI9.x supports SAI_NEIGHBOR_ENTRY_ATTR_IS_LOCAL on brcm_sai_set_neighbor_entry_attribute() .

Answer: it is not supported on SAI 9.x``` 
ysmanman commented 1 month ago

Opened CSP CS00012360402 to track the issue.

kenneth-arista commented 1 month ago

The SONiC behavior changed between 202205 and 202405. Specifically, https://github.com/sonic-net/sonic-swss/pull/2577 fixed applying all neighbor attributes, which exposed this problem in the DNX SAI.

robertlperry commented 1 month ago

Hi @vmittal-msft, we have noticed this same failure in release 202305. Once fixed, do you know if it will be backported to the affected releases? Thanks.

Jul 17 15:32:56.950091 xx119 ERR syncd#syncd: [none] SAI_API_NEIGHBOR:brcm_sai_set_neighbor_entry_attribute:597 Error processing nbr entry attribute failed with error Unknown error (0xfffd0000).
Jul 17 15:32:56.950091 xx119 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_SET failed in syncd mode: SAI_STATUS_FAILURE
Jul 17 15:32:56.950091 xx119 ERR syncd#syncd: :- processQuadEvent: attr: SAI_NEIGHBOR_ENTRY_ATTR_NO_HOST_ROUTE: true
Jul 17 15:32:56.950262 xx119 ERR swss#orchagent: :- set: set status: SAI_STATUS_FAILURE
Jul 17 15:32:56.950286 xx119 ERR swss#orchagent: :- addNeighbor: Failed to update neighbor xx:xx:xx:xx:xx:xx on Ethernet49, attr.id=0x3, rv:-1
Jul 17 15:32:56.950286 xx119 ERR swss#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_NEIGHBOR, status: SAI_STATUS_FAILURE
Jul 17 15:32:56.950295 xx119 NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
Jul 17 15:32:56.950498 xx119 NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump
Jul 17 15:32:56.955937 xx119 NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded
Jul 17 15:32:57.717095 xx119 INFO swss#supervisord 2024-07-17 15:32:57,716 INFO exited: orchagent (terminated by SIGABRT (core dumped); not expected)
Jul 17 15:32:58.721233 xx119 INFO swss#supervisor-proc-exit-listener: Process 'orchagent' exited unexpectedly. Terminating supervisor 'swss'
Jul 17 15:32:58.721364 xx119 NOTICE swss#supervisor-proc-exit-listener: :- publish: EVENT_PUBLISHED: {"sonic-events-host:process-exited-unexpectedly":{"ctr_name":"swss","process_name":"orchagent","timestamp":"2024-07-17T15:32:58.721202Z"}}
Jul 17 15:32:58.723282 xx119 INFO swss#supervisord 2024-07-17 15:32:58,722 WARN received SIGTERM indicating exit request
Jul 17 15:32:58.723282 xx119 INFO swss#supervisord 2024-07-17 15:32:58,722 INFO waiting for supervisor-proc-exit-listener, rsyslogd, portsyncd, coppmgrd, arp_update, ndppd, neighsyncd, vlanmgrd, intfmgrd, portmgrd, buffermgrd, vrfmgrd, nbrmgrd, vxlanmgrd, fdbsyncd, tunnelmgrd to die
Jul 17 15:32:58.723763 xx119 INFO swss#supervisord 2024-07-17 15:32:58,723 INFO stopped: tunnelmgrd (terminated by SIGTERM)
Jul 17 15:32:58.724977 xx119 INFO swss#supervisord 2024-07-17 15:32:58,724 INFO stopped: fdbsyncd (terminated by SIGTERM)
Jul 17 15:32:58.726371 xx119 INFO swss#supervisord 2024-07-17 15:32:58,725 INFO stopped: vxlanmgrd (terminated by SIGTERM)
Jul 17 15:32:58.727726 xx119 INFO swss#supervisord 2024-07-17 15:32:58,727 INFO stopped: nbrmgrd (terminated by SIGTERM)
Jul 17 15:32:58.729011 xx119 INFO swss#supervisord 2024-07-17 15:32:58,728 INFO stopped: vrfmgrd (terminated by SIGTERM)
Jul 17 15:32:59.731800 xx119 INFO swss#supervisord 2024-07-17 15:32:59,731 INFO stopped: buffermgrd (terminated by SIGTERM)
Jul 17 15:32:59.732826 xx119 INFO swss#supervisord 2024-07-17 15:32:59,732 INFO stopped: portmgrd (terminated by SIGTERM)
Jul 17 15:32:59.734000 xx119 INFO swss#supervisord 2024-07-17 15:32:59,733 INFO stopped: intfmgrd (terminated by SIGTERM)
Jul 17 15:32:59.735050 xx119 INFO swss#supervisord 2024-07-17 15:32:59,734 INFO stopped: vlanmgrd (terminated by SIGTERM)
Jul 17 15:33:00.737843 xx119 INFO swss#supervisord 2024-07-17 15:33:00,737 INFO stopped: neighsyncd (terminated by SIGTERM)
Jul 17 15:33:00.737843 xx119 INFO swss#supervisord: message repeated 10 times: [ orchagent ]
Jul 17 15:33:00.737843 xx119 INFO swss#supervisord: ndppd (error) Shutting down...
Jul 17 15:33:00.737884 xx119 INFO swss#supervisord: ndppd (notice) Bye
Jul 17 15:33:00.738500 xx119 INFO swss#supervisord 2024-07-17 15:33:00,738 INFO stopped: ndppd (exit status 0)
Jul 17 15:33:01.740548 xx119 INFO swss#supervisord 2024-07-17 15:33:01,739 INFO stopped: arp_update (terminated by SIGTERM)
Jul 17 15:33:01.740548 xx119 INFO swss#supervisord 2024-07-17 15:33:01,740 INFO waiting for supervisor-proc-exit-listener, rsyslogd, portsyncd, coppmgrd to die
Jul 17 15:33:01.741848 xx119 INFO swss#supervisord 2024-07-17 15:33:01,741 INFO stopped: coppmgrd (terminated by SIGTERM)
Jul 17 15:33:03.745768 xx119 INFO swss#supervisord 2024-07-17 15:33:03,745 INFO stopped: portsyncd (terminated by SIGTERM
)
$ bcmcmd "bsv"
bsv
BRCM SAI ver: [8.4.39.2], OCP SAI ver: [1.11.0], SDK ver: [sdk-6.5.27] CANCUN ver: [06.12.00]
drivshell>
$
kenneth-arista commented 2 weeks ago

The fix from Broadcom is available in DNX SAI 11.2.7.1