sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
724 stars 1.38k forks source link

[dualtor] Failed to set attribute dscp_mode with value uniform - triggers OA shutdown #9957

Closed vaibhavhd closed 2 years ago

vaibhavhd commented 2 years ago

Description

Failed to set attribute dscp_mode with value uniform. This triggers OA shutdown and OA never recovers:

Feb 10 18:47:57.239945 str2-7050cx3-acs-06 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_ATTR_DECAP_DSCP_MODE: SAI_TUNNEL_DSCP_MODE_UNIFORM_MODEL
Feb 10 18:47:57.241992 str2-7050cx3-acs-06 ERR swss#orchagent: :- set: set status: SAI_STATUS_NOT_IMPLEMENTED
Feb 10 18:47:57.242084 str2-7050cx3-acs-06 ERR swss#orchagent: :- setTunnelAttribute: Failed to set attribute dscp_mode with value uniform
Feb 10 18:47:57.242123 str2-7050cx3-acs-06 ERR swss#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_TUNNEL, status: SAI_STATUS_NOT_IMPLEMENTED

Steps to reproduce the issue:

  1. Install 202012 image on dualtor testbed
  2. Issue warmboot on one of the TORs - no test or IO traffic needed for repro.
  3. Check logs for failures.

Describe the results you received:

Feb 10 18:47:57.076147 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- createPolicer: Bind policer to trap group default:
Feb 10 18:47:57.082141 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- processCoppRule: Create host interface trap group queue1_group1
Feb 10 18:47:57.082291 str2-7050cx3-acs-06 WARNING swss#orchagent: :- trapGroupUpdatePolicer: Creating policer for existing Trap group: 110000000009e7 (name:queue1_group1).
Feb 10 18:47:57.091740 str2-7050cx3-acs-06 NOTICE pmon#ycable[35]: y_cable_port 2: Initialized simulated y_cable driver, port=2, index=1
Feb 10 18:47:57.093495 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- createPolicer: Create policer for trap group queue1_group1
Feb 10 18:47:57.094592 str2-7050cx3-acs-06 WARNING mux#linkmgrd: message repeated 8 times: [ MuxManager.cpp:170 addOrUpdateMuxPortLinkState: Ethernet108: link state: down]
Feb 10 18:47:57.094801 str2-7050cx3-acs-06 WARNING mux#linkmgrd: MuxManager.cpp:170 addOrUpdateMuxPortLinkState: Ethernet96: link state: up
Feb 10 18:47:57.096932 str2-7050cx3-acs-06 INFO lldp#lldp-syncd [lldp_syncd] INFO: Failed to get system capabilities on eth0 (50:2f:a8:a5:03:f1)
Feb 10 18:47:57.098086 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- createPolicer: Bind policer to trap group queue1_group1:
Feb 10 18:47:57.102384 str2-7050cx3-acs-06 WARNING syncd#syncd: [none] SAI_API_HOSTIF:_brcm_sai_set_cpu_queue_shaper:13127 Set CPU Queue 1 shaping: cir 6000, cbs 24576
Feb 10 18:47:57.116085 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- processCoppRule: Create host interface trap group queue1_group2
Feb 10 18:47:57.116139 str2-7050cx3-acs-06 WARNING swss#orchagent: :- trapGroupUpdatePolicer: Creating policer for existing Trap group: 110000000009ea (name:queue1_group2).
Feb 10 18:47:57.118477 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- createPolicer: Create policer for trap group queue1_group2
Feb 10 18:47:57.122574 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- createPolicer: Bind policer to trap group queue1_group2:
Feb 10 18:47:57.152175 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- processCoppRule: Create host interface trap group queue2_group1
Feb 10 18:47:57.152234 str2-7050cx3-acs-06 WARNING swss#orchagent: :- trapGroupUpdatePolicer: Creating policer for existing Trap group: 110000000009ee (name:queue2_group1).
Feb 10 18:47:57.154027 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- createPolicer: Create policer for trap group queue2_group1
Feb 10 18:47:57.155355 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- createPolicer: Bind policer to trap group queue2_group1:
Feb 10 18:47:57.161374 str2-7050cx3-acs-06 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet16 admin:1 oper:1 addr:d4:af:f7:4d:a4:44 ifindex:25 master:9
Feb 10 18:47:57.161639 str2-7050cx3-acs-06 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet24 admin:1 oper:1 addr:d4:af:f7:4d:a4:44 ifindex:26 master:9
Feb 10 18:47:57.161758 str2-7050cx3-acs-06 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet20 admin:1 oper:1 addr:d4:af:f7:4d:a4:44 ifindex:24 master:9
Feb 10 18:47:57.174348 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- processCoppRule: Create host interface trap group queue4_group1
Feb 10 18:47:57.183921 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- processCoppRule: Create host interface trap group queue4_group2
Feb 10 18:47:57.184203 str2-7050cx3-acs-06 WARNING swss#orchagent: :- trapGroupUpdatePolicer: Creating policer for existing Trap group: 110000000009f7 (name:queue4_group2).
Feb 10 18:47:57.185950 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- createPolicer: Create policer for trap group queue4_group2
Feb 10 18:47:57.189988 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- createPolicer: Bind policer to trap group queue4_group2:
Feb 10 18:47:57.194427 str2-7050cx3-acs-06 INFO syncd#syncd: [none] SAI_API_HOSTIF:_brcm_sai_tt_ucast_arp_trap_add:14641 TT Ucast Arp Mac [d4aff74da444] already added
Feb 10 18:47:57.211386 str2-7050cx3-acs-06 NOTICE pmon#ycable[35]: y_cable_port 7: Initialized simulated y_cable driver, port=7, index=6
Feb 10 18:47:57.215664 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- processCoppRule: Create host interface trap group queue4_group3
Feb 10 18:47:57.239821 str2-7050cx3-acs-06 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_SET failed in syncd mode: SAI_STATUS_NOT_IMPLEMENTED
Feb 10 18:47:57.239898 str2-7050cx3-acs-06 ERR syncd#syncd: :- processQuadEvent: VID: oid:0x2a000000000975 RID: oid:0x2a00000003
Feb 10 18:47:57.239945 str2-7050cx3-acs-06 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_ATTR_DECAP_DSCP_MODE: SAI_TUNNEL_DSCP_MODE_UNIFORM_MODEL
Feb 10 18:47:57.241992 str2-7050cx3-acs-06 ERR swss#orchagent: :- set: set status: SAI_STATUS_NOT_IMPLEMENTED
Feb 10 18:47:57.242084 str2-7050cx3-acs-06 ERR swss#orchagent: :- setTunnelAttribute: Failed to set attribute dscp_mode with value uniform
Feb 10 18:47:57.242123 str2-7050cx3-acs-06 ERR swss#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_TUNNEL, status: SAI_STATUS_NOT_IMPLEMENTED
Feb 10 18:47:57.242160 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- uninitialize: begin
Feb 10 18:47:57.242198 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- uninitialize: begin
Feb 10 18:47:57.242239 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread begin
Feb 10 18:47:57.242291 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread end
Feb 10 18:47:57.242328 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- clear_local_state: clearing local state
Feb 10 18:47:57.242365 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- meta_init_db: begin
Feb 10 18:47:57.263924 str2-7050cx3-acs-06 NOTICE pmon#ycable[35]: y_cable_port 24: Initialized simulated y_cable driver, port=24, index=23
Feb 10 18:47:57.268666 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- meta_init_db: end
Feb 10 18:47:57.268946 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- uninitialize: end
Feb 10 18:47:57.268946 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- stopRecording: stopped recording
Feb 10 18:47:57.268946 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- stopRecording: closed recording file: sairedis.rec
Feb 10 18:47:57.268971 str2-7050cx3-acs-06 NOTICE swss#orchagent: :- uninitialize: end

After this, the OA never restarted(?). From the logs it appears that OA was not running:

Feb 10 18:48:58.489869 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (1.0 minutes).
Feb 10 18:49:58.552638 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (2.0 minutes).
Feb 10 18:50:58.709279 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (3.0 minutes).
Feb 10 18:51:58.770235 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (4.0 minutes).
Feb 10 18:52:58.829272 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (5.0 minutes).
Feb 10 18:53:58.890567 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (6.0 minutes).
Feb 10 18:54:58.949263 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (7.0 minutes).
Feb 10 18:55:59.007521 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (8.0 minutes).
Feb 10 18:56:59.065269 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (9.0 minutes).
Feb 10 18:57:59.124082 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (10.0 minutes).
Feb 10 18:58:59.184322 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (11.0 minutes).
Feb 10 18:59:59.242557 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (12.0 minutes).
Feb 10 19:00:59.302655 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (13.0 minutes).
Feb 10 19:01:59.363029 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (14.0 minutes).
Feb 10 19:02:59.425078 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (15.0 minutes).
Feb 10 19:03:59.485128 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (16.0 minutes).
Feb 10 19:04:59.543130 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (17.0 minutes).
Feb 10 19:05:59.603267 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (18.0 minutes).
Feb 10 19:06:59.661152 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (19.0 minutes).
Feb 10 19:07:59.720740 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (20.0 minutes).
Feb 10 19:08:59.780418 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (21.0 minutes).
Feb 10 19:09:59.836849 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (22.0 minutes).
Feb 10 19:10:59.895864 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (23.0 minutes).
Feb 10 19:11:59.956282 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (24.0 minutes).
Feb 10 19:13:00.016565 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (25.0 minutes).
Feb 10 19:14:00.074421 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (26.0 minutes).
Feb 10 19:15:00.135211 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (27.0 minutes).
Feb 10 19:16:00.195579 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (28.0 minutes).
Feb 10 19:17:00.257117 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (29.0 minutes).
Feb 10 19:18:00.318259 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (30.0 minutes).
Feb 10 19:19:00.376290 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (31.0 minutes).
Feb 10 19:20:00.437097 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (32.0 minutes).
Feb 10 19:21:00.497422 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (33.0 minutes).
Feb 10 19:22:00.555180 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (34.0 minutes).
Feb 10 19:23:00.614356 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (35.0 minutes).
Feb 10 19:24:00.671977 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (36.0 minutes).
Feb 10 19:25:00.732426 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (37.0 minutes).
Feb 10 19:26:00.791232 str2-7050cx3-acs-06 ERR swss#/supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (38.0 minutes).

From the swss container no orchagent running:

root@str2-7050cx3-acs-06:/# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.2  0.1  32680 24504 pts/0    Ss+  18:47   0:06 /usr/bin/python3 /usr/local/bin/supervisord
root          28  0.0  0.1  27800 19500 pts/0    S    18:47   0:00 python3 /usr/bin/supervisor-proc-exit-listener --container-name swss
root          31  0.0  0.0 225856  5688 pts/0    Sl   18:47   0:00 /usr/sbin/rsyslogd -n -iNONE
root          36  0.0  0.0  81212  5052 pts/0    Sl   18:47   0:00 /usr/bin/portsyncd
root          59  0.0  0.0  88012  6768 pts/0    Sl   18:47   0:00 /usr/bin/coppmgrd
root          66  0.0  0.0   4096  3348 pts/0    S    18:47   0:00 /bin/bash /usr/bin/arp_update
root          67  0.0  0.0  81196  4800 pts/0    Sl   18:47   0:00 /usr/bin/neighsyncd
root          69  0.0  0.0  88072  8412 pts/0    Sl   18:47   0:00 /usr/bin/vlanmgrd
root          71  0.0  0.0  88068  8200 pts/0    Sl   18:47   0:00 /usr/bin/intfmgrd
root          73  0.0  0.0  88036  8368 pts/0    Sl   18:47   0:00 /usr/bin/portmgrd
root          75  0.0  0.0  88260  8416 pts/0    Sl   18:47   0:00 /usr/bin/buffermgrd -l /usr/share/sonic/hwsku/pg_profile_lookup.ini
root          97  0.0  0.0  88056  8224 pts/0    Sl   18:47   0:00 /usr/bin/vrfmgrd
root         109  0.0  0.0  87808  6816 pts/0    Sl   18:47   0:00 /usr/bin/nbrmgrd
root         124  0.0  0.0  88096  8168 pts/0    Sl   18:47   0:00 /usr/bin/vxlanmgrd
root         153  0.0  0.0  81116  4816 pts/0    Sl   18:47   0:00 /usr/bin/fdbsyncd
root         155  0.0  0.0  88036  8308 pts/0    Sl   18:47   0:00 /usr/bin/tunnelmgrd
root         258  0.0  0.0   5668  1632 pts/0    S    18:47   0:00 /usr/sbin/ndppd
root         571  0.4  0.2 131348 43908 pts/0    Sl   18:48   0:10 python3 /usr/bin/tunnel_packet_handler.py
root        2502  0.1  0.0   3984  3324 pts/1    Ss   19:28   0:00 bash
root        2758  0.0  0.0   2524   748 pts/0    S    19:28   0:00 sleep 300
root        2759  0.0  0.0   7640  2736 pts/1    R+   19:28   0:00 ps -aux
root@str2-7050cx3-acs-06:/#

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

Techsupport:

sonic_dump_str2-7050cx3-acs-06_20220210_185619.tar.gz

gechiang commented 2 years ago

Investigated this issue and found 2 probl;ems:

  1. Tunnel Mgr is not handling warmrebopot currently resulted to attempt to recreate the MUX tunnel which already exists and this caused exposing issue # 2 (@prsunny )
  2. Known BRCM issue where it can only handle Tunnel attributes during Tunnel creation but does not support tunnel attribute SET operations. (CSP CS00012231236 created with BRCM)