sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
705 stars 1.35k forks source link

Mclag - iccpd crash, orchagent crash #16075

Open mstroecker opened 11 months ago

mstroecker commented 11 months ago

Praveen Elagala already provided some analysis in Google Groups: https://groups.google.com/g/sonicproject/c/00rnM19XgDs

Description

We encountered a problem regarding iccpd and mclag. We use two switches, leafa and leafb (Model) in an L2-scenario. We followed the configuration-example on: https://support.edge-core.com/hc/en-us/articles/900002380706--Enterprise-SONiC-MC-LAG

Since the official build does not include iccpd, we build an image of the 202305 branch with iccpd enabled. (202205 has the same problem)

We are using Portchannel01 as the peerlink with two Eternet-Interfaces and one mclag-instance on that peerlink. After that we added some Portchannels on both sides, we tested this configuration some weeks without a problem. But at some point the iccpd crashes and the mclag-pair was broken. We have to reboot the switches, but after some seconds running as expected the iccpd crashes again and leaves the mclag-pair in an running but broken state. We tried to debug this situation and saw that, if we only run one mclag-enabled switch (leafb for example) the mclag is in error-state but we are able to see the known mac-addresses with mclagctl -i 1 dump macs. Now we wanted to re-add leafa. To circumvent any configuration-diffs in the PortChannels we removed all MCLAG-PortChannels from leafa (only mgmt-int and peerlink is configured) and applied the mclag related config:

config mclag add 1 192.168.10.1 192.168.10.2 PortChannel01
config mclag unique-ip add Vlan10
config interface ip add Vlan10 [192.168.10.1/24](http://192.168.10.1/24)

Right after the last command on leafa the iccpd crashes on leafb. After rebooting both switches work as before.

In the logs we found the following line on both switches:

Jun 21 18:04:08.128694 leafa INFO iccpd#supervisord: iccpd *** stack smashing detected ***: terminated
Jun 22 18:03:26.217470 leafb INFO iccpd#supervisord: iccpd *** stack smashing detected ***: terminated

We also found these lines in the near of the other ones:

Jun 21 18:04:08.128694 leafa ERR swss#orchagent: :- setMembers: Port Ethe not supported
Jun 22 18:03:26.253784 leafb ERR swss#orchagent: :- setMembers: Port Eth not supported

As you can see the string Eth(e) seems to be cut off. Btw.: Currently we have only one single Ethernet-Uplink on leafa which is shared across the peerlink. We also tried removing it on leafa and try to start the mclag-pair without any luck. iccpd crashes with the same error/behavior.

To be clear, we had this problem first when both switches had the full MCLAG-PortChannel setup. We created tech-support-files on both switches right after the crash and before we reboot them.

Steps to reproduce the issue:

  1. Create the described scenario
  2. Reboot both switches
  3. Wait a couple of seconds

Describe the results you received:

It seems to be okay for a few seconds after that: (Core Dumps for iccpd and orchagent are available)

root@leafb:~# mclagdctl -i 1 dump state
The MCLAG's keepalive is: ERROR
MCLAG info sync is: incomplete
Domain id: 1
Local Ip: 192.168.10.2
Peer Ip: 192.168.10.1
Peer Link Interface: PortChannel01
Keepalive time: 1
sesssion Timeout : 15
Peer Link Mac: 64:9d:99:3a:d8:cc 
Role: Standby
MCLAG Interface: PortChannel05,PortChannel03,PortChannel02,PortChannel18,PortChannel21,PortChannel17,PortChannel16,PortChannel13,PortChannel11,PortChannel19,PortChannel14,PortChannel06,PortChannel10,PortChannel20,PortChannel24,PortChannel09,PortChannel12,PortChannel15,PortChannel23,PortChannel07,PortChannel26,PortChannel08,PortChannel25,PortChannel04,PortChannel22
Loglevel: NOTICE

Describe the results you expected:

A working mclag state.

Output of show version:

admin@leafb:~$ show version

SONiC Software Version: SONiC.202305.0-dirty-20230621.085841
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: 65c15aa1f
Build date: Wed Jun 21 09:41:21 UTC 2023
Built by: builder@scw-awesome-hopper

Platform: x86_64-accton_as7326_56x-r0
HwSKU: Accton-AS7326-56X
ASIC: broadcom
ASIC Count: 1
Serial Number: HWCG2106249719N00021
Model Number: FP4EC7656200Z
Hardware Revision: N/A
Uptime: 06:58:44 up 46 days, 12:49,  1 user,  load average: 1.32, 1.03, 0.89
Date: Tue 08 Aug 2023 06:58:44

Docker images:
REPOSITORY                    TAG                              IMAGE ID       SIZE
docker-orchagent              202305.0-dirty-20230621.085841   d7d03428f0a3   328MB
docker-orchagent              latest                           d7d03428f0a3   328MB
docker-fpm-frr                202305.0-dirty-20230621.085841   1950b59302ad   346MB
docker-fpm-frr                latest                           1950b59302ad   346MB
docker-nat                    202305.0-dirty-20230621.085841   050b81a61dbe   319MB
docker-nat                    latest                           050b81a61dbe   319MB
docker-sflow                  202305.0-dirty-20230621.085841   0a180d33299a   318MB
docker-sflow                  latest                           0a180d33299a   318MB
docker-teamd                  202305.0-dirty-20230621.085841   48dcede9e7bf   316MB
docker-teamd                  latest                           48dcede9e7bf   316MB
docker-iccpd                  202305.0-dirty-20230621.085841   60d184ea8d50   316MB
docker-iccpd                  latest                           60d184ea8d50   316MB
docker-macsec                 latest                           916697d24a6a   319MB
docker-syncd-brcm             202305.0-dirty-20230621.085841   0ab52f956a84   673MB
docker-syncd-brcm             latest                           0ab52f956a84   673MB
docker-gbsyncd-broncos        202305.0-dirty-20230621.085841   97d5f414ffc6   348MB
docker-gbsyncd-broncos        latest                           97d5f414ffc6   348MB
docker-gbsyncd-credo          202305.0-dirty-20230621.085841   181ce966730b   321MB
docker-gbsyncd-credo          latest                           181ce966730b   321MB
docker-dhcp-relay             latest                           36a1cfe4ae7d   306MB
docker-eventd                 202305.0-dirty-20230621.085841   7662206b046e   299MB
docker-eventd                 latest                           7662206b046e   299MB
docker-platform-monitor       202305.0-dirty-20230621.085841   491c3c8da657   420MB
docker-platform-monitor       latest                           491c3c8da657   420MB
docker-snmp                   202305.0-dirty-20230621.085841   48b54648a0b3   338MB
docker-snmp                   latest                           48b54648a0b3   338MB
docker-sonic-telemetry        202305.0-dirty-20230621.085841   bc804f2edb8e   599MB
docker-sonic-telemetry        latest                           bc804f2edb8e   599MB
docker-sonic-p4rt             202305.0-dirty-20230621.085841   995fa1615289   870MB
docker-sonic-p4rt             latest                           995fa1615289   870MB
docker-lldp                   202305.0-dirty-20230621.085841   f0170b81c990   341MB
docker-lldp                   latest                           f0170b81c990   341MB
docker-database               202305.0-dirty-20230621.085841   c455a6a8aae8   299MB
docker-database               latest                           c455a6a8aae8   299MB
docker-mux                    202305.0-dirty-20230621.085841   fd75979c72bd   347MB
docker-mux                    latest                           fd75979c72bd   347MB
docker-router-advertiser      202305.0-dirty-20230621.085841   8a14865049c1   299MB
docker-router-advertiser      latest                           8a14865049c1   299MB
docker-sonic-mgmt-framework   202305.0-dirty-20230621.085841   907c8c41dea6   414MB
docker-sonic-mgmt-framework   latest                           907c8c41dea6   414MB
prom/node-exporter            v1.3.1                           1dbe0e931976   20.9MB

Output of show techsupport:

https://crossmediasolutions-my.sharepoint.com/:f:/g/personal/m_stroecker_4allportal_com/EtcT8kAQtZxDpGLIv1bSCJkB2_VPhsv-3yOoy2li3XOxug?e=gi8kbF

Additional information you deem important (e.g. issue happens only occasionally):

I built the images with symbols and generated the requested stack traces:

ICCPD:

docker run -it -v $PWD:/work --entrypoint bash docker-iccpd-dbg
root@ceb8db9ff0c9:/# gdb /usr/bin/iccpd /work/iccpd.1687374847.23.core
[...snip]
(gdb) bt
#0  0x00007f5ef0a09ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f5ef09f3537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f5ef0a4b3a8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f5ef0adc542 in __fortify_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f5ef0adc520 in __stack_chk_fail () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00005566de51a845 in update_peerlink_isolate_from_all_csm_lif (csm=0x5566df6e49b0) at mlacp_link_handler.c:1209
#6  0x00005566de51a9e3 in set_peerlink_mlag_port_isolate (csm=0x5566df6e49b0, lif=0x7ffdb6b6e030, lif@entry=0x5566df6ee480, enable=1, is_unbind_pending=225) at mlacp_link_handler.c:1233
#7  0x00005566de51accc in update_peerlink_isolate_from_lif (csm=csm@entry=0x5566df6e49b0, lif=lif@entry=0x5566df6ee480, lif_po_state=lif_po_state@entry=1) at mlacp_link_handler.c:1368
#8  0x00005566de51dc67 in update_peerlink_isolate_from_lif (lif_po_state=1, lif=0x5566df6ee480, csm=0x5566df6e49b0) at mlacp_link_handler.c:1776
#9  mlacp_portchannel_state_handler (csm=0x5566df6e49b0, local_if=0x5566df6ee480, po_state=1) at mlacp_link_handler.c:2104
#10 0x00005566de521346 in mlacp_portchannel_state_handler (po_state=<optimized out>, local_if=0x5566df6ee480, csm=0x5566df6e49b0) at mlacp_link_handler.c:2094
#11 mlacp_peer_conn_handler (csm=csm@entry=0x5566df6e49b0) at mlacp_link_handler.c:2281
#12 0x00005566de5271c8 in mlacp_fsm_transit (csm=csm@entry=0x5566df6e49b0) at mlacp_fsm.c:916
#13 0x00005566de517bc8 in scheduler_transit_fsm () at scheduler.c:116
#14 scheduler_loop () at scheduler.c:479
#15 0x00005566de517c97 in scheduler_start () at scheduler.c:534
#16 0x00005566de50cc5d in main (argc=<optimized out>, argv=0x7ffdb6b6e990) at iccp_main.c:266
(gdb)

Orchagent:

docker run -it -v $PWD:/work --entrypoint bash docker-orchagent-dbg
root@4f2fd291d05c:/# gdb /usr/bin/orchagent /work/orchagent.1687370636.52.core
[...snip]
(gdb) bt
#0  0x00007f6a7d8fdce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f6a7d8e7537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000055d8918c237c in handleSaiFailure (abort_on_failure=<optimized out>) at saihelper.cpp:771
#3  0x000055d891b0c6d7 in handleSaiRemoveStatus (api=api@entry=SAI_API_FDB, status=-2021033792, status@entry=-7, context=context@entry=0x0) at saihelper.cpp:700
#4  0x000055d891ab44e7 in FdbOrch::removeFdbEntry (this=0x55d8924ed2b0, entry=..., origin=<optimized out>) at fdborch.cpp:1621
#5  0x000055d891ab4dd5 in FdbOrch::doTask (this=0x55d8924ed2b0, consumer=...) at fdborch.cpp:853
#6  0x000055d8919898bd in Consumer::drain (this=0x55d8924e8000) at orch.cpp:264
#7  Consumer::drain (this=0x55d8924e8000) at orch.cpp:261
#8  Consumer::execute (this=0x55d8924e8000) at orch.cpp:258
#9  0x000055d8919795f8 in OrchDaemon::start (this=this@entry=0x55d8924a1100) at orchdaemon.cpp:769
#10 0x000055d8918f6da6 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:766
(gdb)
selvatechtalk commented 1 month ago

May I know if there any fix available for "Issue1:" ? Issue #1 : For the ICCPd below are the logs during the crash: It looks some of the ebtables updates are not supported.

mstroecker commented 4 weeks ago

Hi @selvatechtalk, unfortunately, we were not able to fix it and stopped testing SONiC. It's been a while though; maybe something has changed on the ICCPd front.