sonic-net / sonic-utilities

Command line utilities for the SONiC project
Other
152 stars 650 forks source link

[GCU/swss] ERR#swss removeLag error in SYSLOG #2156

Open wen587 opened 2 years ago

wen587 commented 2 years ago

Description

There seems to have some execution delay in swss when executing GCU jsonChange. The delay will cause SYSLOG ERR about removeLag. Possible execution delay related code: (Executed before portchannel removal)

See below for more details.

Steps to reproduce the issue

  1. Add one portchannel and its interface into configDB through GCU.
    
    admin@vlab-01:~/po/test$ cat tc1.json
    [
        {"path": "/PORTCHANNEL/PortChannel0005", "value": {"admin_status": "up"}, "op": "add"},
        {"path": "/PORTCHANNEL_INTERFACE/PortChannel0005", "value": {}, "op": "add"},
        {"path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131", "value": {}, "op": "add"},
        {"path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126", "value": {}, "op": "add"}]

admin@vlab-01:~/po/test$ sudo config apply-patch tc1.json ... Patch Applier: Applying 4 changes in order: Patch Applier: [{"op": "add", "path": "/PORTCHANNEL/PortChannel0005", "value": {"admin_status": "up"}}] Patch Applier: [{"op": "add", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005", "value": {}}] Patch Applier: [{"op": "add", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131", "value": {}}] Patch Applier: [{"op": "add", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126", "value": {}}] Patch Applier: Verifying patch updates are reflected on ConfigDB. Patch Applier: Patch application completed. Patch applied successfully.

2. Remove or rollback the previous change. Check SYSLOG ERR.

admin@vlab-01:~/po/test$ cat tc1_rm.json [ { "op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126" }, { "op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131" }, { "op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005" }, { "op": "remove", "path": "/PORTCHANNEL/PortChannel0005" } ] admin@vlab-01:~/po/test$ sudo config apply-patch tc1_rm.json ... Patch Applier: Applying 4 changes in order: Patch Applier: [{"op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005"}] Patch Applier: [{"op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131"}] Patch Applier: [{"op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126"}] Patch Applier: [{"op": "remove", "path": "/PORTCHANNEL/PortChannel0005"}] Patch Applier: Verifying patch updates are reflected on ConfigDB. Patch Applier: Patch application completed. Patch applied successfully.

_SYSLOG ERR:_

May 10 03:17:06.199268 vlab-01 ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel0005 May 10 03:17:06.199325 vlab-01 ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel0005'. Skipping May 10 03:17:07.241855 vlab-01 ERR swss#intfmgrd: :- setIntfVrf: Command '/sbin/ip link set "PortChannel0005" nomaster' failed with rc 1 May 10 03:17:07.241855 vlab-01 ERR swss#orchagent: message repeated 4 times: [ :- removeLag: Failed to remove ref count 1 LAG PortChannel0005]

From my undestanding. `ERR teamd#tlm_teamd: :- get_dump` is acceptable. 
It will occur even through config CLI `config portchannel del <>`.

3. If we remove PORTCHANNEL_INTERFACE first then remove PORTCHANNEL, there will be no error. So I am wondering there is execution delay in swss.
Splitting to two apply-patch and no error occur:

admin@vlab-01:~/po/test$ cat tc1_part1.json [ { "op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|FC00::81~1126" }, { "op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005|10.0.0.64~131" }, { "op": "remove", "path": "/PORTCHANNEL_INTERFACE/PortChannel0005" }]

admin@vlab-01:~/po/test$ cat tc1_part2.json [ { "op": "remove", "path": "/PORTCHANNEL/PortChannel0005" } ]


#### Describe the results you received
SYSLOG ERR when remove Portchannel and its Portchannel Interface together through GCU.

#### Describe the results you expected
Not sure if we can avoid that ERR or just think of it as valid message.

#### Additional information you deem important (e.g. issue happens only occasionally)

#### Output of `show version`

admin@vlab-01:~/po/test$ show ver

SONiC Software Version: SONiC.master-10763.96436-aa5cdcc51 Distribution: Debian 11.3 Kernel: 5.10.0-8-2-amd64 Build commit: aa5cdcc51 Build date: Fri May 6 06:25:04 UTC 2022 Built by: AzDevOps@sonic-build-workers-001HFQ

Platform: x86_64-kvm_x86_64-r0 HwSKU: Force10-S6000 ASIC: vs ASIC Count: 1 Serial Number: N/A Model Number: N/A Hardware Revision: N/A Uptime: 03:22:25 up 23:34, 2 users, load average: 0.07, 0.16, 0.17 Date: Tue 10 May 2022 03:22:25



<!--
     Also attach debug file produced by `sudo generate_dump`
-->
ghooo commented 2 years ago

A few questions:

wen587 commented 2 years ago

A few questions:

  • Regarding config portchannel del <>, are you saying the same errors occur there? or only ERR teamd#tlm_teamd: :- get_dump?

Only ERR teamd#tlm_teamd: :- get_dump

  • Also what does orcagent Failed to remove ref count 1 LAG PortChannel0005 error mean, is it checking redis or checking something else? if it is a redis issue, maybe we need to double check how we interact with redis, if it is async, I think we should make it sync or introduce some wait

After read https://github.com/Azure/sonic-swss/blob/master/orchagent/portsorch.cpp#L5105-L5115, I think it is just not related to redis. Looks like a async issue. Not sure why m_port_ref_count is not 0 during the PortChannel Interface removal.

  • What is LAG? why is the error referring to it

LAG is link aggregation group, which is PortChannel in our code base. LAG removal refers to PortChannel removal.

wen587 commented 2 years ago

It does not impact the final result. Current workaround is to keep the Log Analyzer error in ignored list.