sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
723 stars 1.38k forks source link

ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0001 is seen during config reload #7317

Open dgsudharsan opened 3 years ago

dgsudharsan commented 3 years ago

Description

When config reload is given with PortChannel part of a VLAN, below logs are seen in syslog Apr 14 01:24:14.912865 r-tigon-15 ERR swss#orchagent: :- removeLag: Failed to remove LAG PortChannel0002, it is still in VLAN

This is due to the fact that teammgrd and teamsyncd perform cleanups during config reload while other modules like VLAN don't leading to the check hit in orchagent. The below change was introduced to cleanup interfaces in kernel. However since it also performs APP_DB delete, orchagent handles it and since the references are not cleared it throws the error.

https://github.com/Azure/sonic-swss/pull/1159

Steps to reproduce the issue:

  1. Create portchannel and add member- config portchannel add PortChannel0002 config portchannel member add PortChannel0002 Ethernet256 (Please make sure you configure portchannel in the peer end and the port channel status is shown as up)
  2. Add port channel to VLANs config vlan add 40 config vlan member add 40 PortChannel0002 config vlan add 69 config vlan member add 69 PortChannel0002
  3. Perform config save and config reload config save -y config reload -y

Describe the results you received:

Got the error syslog shown above

Describe the results you expected:

No error syslog should be thrown

Output of show version:

SONiC Software Version: SONiC.SONIC.202012.62-a06e6d3_Internal
Distribution: Debian 10.9
Kernel: 4.19.0-12-2-amd64
Build commit: a06e6d3f
Build date: Sat Apr 10 17:08:08 UTC 2021
Built by: sw-r2d2-bot@r-build-sonic-ci02

Platform: x86_64-mlnx_msn4600c-r0
HwSKU: ACS-MSN4600C
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2053X21259
Uptime: 01:51:21 up  2:36,  2 users,  load average: 0.69, 1.00, 1.12

Docker images:
REPOSITORY                    TAG                                IMAGE ID            SIZE
docker-syncd-mlnx             SONIC.202012.62-a06e6d3_Internal   f4e74794df35        664MB
docker-syncd-mlnx             latest                             f4e74794df35        664MB
docker-snmp                   SONIC.202012.62-a06e6d3_Internal   2ec1a10aa494        440MB
docker-snmp                   latest                             2ec1a10aa494        440MB
docker-teamd                  SONIC.202012.62-a06e6d3_Internal   d80d3a9e3cdd        410MB
docker-teamd                  latest                             d80d3a9e3cdd        410MB
docker-nat                    SONIC.202012.62-a06e6d3_Internal   4fce03e7b64f        413MB
docker-nat                    latest                             4fce03e7b64f        413MB
docker-router-advertiser      SONIC.202012.62-a06e6d3_Internal   5277e788dabd        399MB
docker-router-advertiser      latest                             5277e788dabd        399MB
docker-platform-monitor       SONIC.202012.62-a06e6d3_Internal   3b9afe7f6dcf        690MB
docker-platform-monitor       latest                             3b9afe7f6dcf        690MB
docker-lldp                   SONIC.202012.62-a06e6d3_Internal   2a2437a01dfb        439MB
docker-lldp                   latest                             2a2437a01dfb        439MB
docker-dhcp-relay             SONIC.202012.62-a06e6d3_Internal   2b1c4bd48ddf        406MB
docker-dhcp-relay             latest                             2b1c4bd48ddf        406MB
docker-database               SONIC.202012.62-a06e6d3_Internal   612f04c44c81        399MB
docker-database               latest                             612f04c44c81        399MB
docker-orchagent              SONIC.202012.62-a06e6d3_Internal   43f5b5b19ed5        428MB
docker-orchagent              latest                             43f5b5b19ed5        428MB
docker-sonic-telemetry        SONIC.202012.62-a06e6d3_Internal   5d23ae65635f        489MB
docker-sonic-telemetry        latest                             5d23ae65635f        489MB
docker-sonic-mgmt-framework   SONIC.202012.62-a06e6d3_Internal   25a33c9c2b52        618MB
docker-sonic-mgmt-framework   latest                             25a33c9c2b52        618MB
docker-fpm-frr                SONIC.202012.62-a06e6d3_Internal   25f5ec95a813        427MB
docker-fpm-frr                latest                             25f5ec95a813        427MB
docker-sflow                  SONIC.202012.62-a06e6d3_Internal   08f215157069        410MB
docker-sflow                  latest                             08f215157069        410MB
docker-wjh                    202012.202012.0-4bb0b02            5cc446fc4500        504MB
docker-wjh                    latest                             5cc446fc4500        504MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

This issue is easily reproducible with the steps mentioned and occurs every time. sonic_dump_r-tigon-15_20210414_011349.tar.gz

anshuv-mfst commented 3 years ago

@judyjoseph - could you please take a look, thanks.

dgsudharsan commented 3 years ago

@judyjoseph @anshuv-mfst Is there any update on this?

judyjoseph commented 3 years ago

@dgsudharsan will check on this today to see if if I can repro and will update.

judyjoseph commented 3 years ago

@dgsudharsan I tried this case, yes the error message is there, but this is during the processes are going down ( right as you mentioned team* daemons are cleaning up interfaces ). But after config reload when processes are up, I see port channel in good state - is that a similar behavior you find ?

dgsudharsan commented 3 years ago

@judyjoseph I don't think there is functionality impact as I had mentioned in the description. However, there are log analyzers in customer deployments which would get false alarm because of this error message. I believe the syslog should be clear of errors. In this case may i ask the reason for deleting the APP_DB entry while the actual issue is with clearing the netlink? I feel in clean up scenario(when task exits) the APP_DB deletion need not be performed. Please let me know your thoughts on this.

judyjoseph commented 3 years ago

@dgsudharsan, I agree ideally the APP DB entry deletion will be triggered from NETLINK DEL message. But here when all the processes goes down, teamsyncd won't be waiting for the NETLINK messages to do cleanup ( we might miss events ) when teammgrd removes LAG here https://github.com/Azure/sonic-swss/blob/e29d566efb31378fbeac61f0b1a7dbd690d7e287/cfgmgr/teammgr.cpp#L492. We need to cleanup all LAG entries from APP_DB as well.