sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
730 stars 1.4k forks source link

[teamd] config portchannel del is resulting in an extra keyspace notification #9831

Open vivekrnv opened 2 years ago

vivekrnv commented 2 years ago

Description

A config portchannel delete is causing an extra/unnecessary SET notification before DEL in the state db

This behavior was traced back from the appearance of this error log in tlm_teamd, which was already reported and fixed here https://github.com/Azure/sonic-swss/pull/1629, although because of a different reason

  Jan 21 04:19:09.049965 sonic ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel1111'. Skipping
  Jan 21 04:19:09.051633 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel1111' has been removed.
  Jan 21 04:19:09.802348 sonic INFO teamd#supervisord 2022-01-21 04:19:09,801 INFO reaped unknown pid 649 (exit status 0)
  Jan 21 04:19:09.843253 sonic NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel1111

Steps to Reproduce

root@sonic:/home/admin# sudo config portchannel del PortChannel1111

admin@sonic:~$ redis-cli psubscribe "__keyspace@6__:*LAG*"
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_TABLE|PortChannel1111"
4) "hset"

Message Contents: 
{{first = "admin_status", second = "down"}, 
{first = "oper_status", second = "down"}, 
{first = "mtu", second = "9100"}, 
{first = "state", second = "ok"}}}}

An Example kvpairs when an actual config portchannel add is called
{{first = "admin_status", second = "up"}, 
{first = "oper_status", second = "down"}, 
{first = "mtu", second = "9100"}, 
{first = "state", second = "ok"}, 
{first = "team_device.ifinfo.ifindex", second = "175"}, 
{first = "setup.kernel_team_mode_name", second = "loadbalance"}, 
{first = "runner.active", second = "true"}, 
{first = "runner.fast_rate", second = "false"}, 
{first = "runner.fallback", second = "false"}, {first = "setup.pid", second = "1335"},
{first = "team_device.ifinfo.dev_addr", second = "1c:34:da:1c:a0:00"}

1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_TABLE|PortChannel1111"
4) "del"

This extra notification is probably because there are two netlink msgs generated when the portchannel is deleted and teamsyncd is writing them to state_db, causing two notifications

This is causing an error log seen for tlm_teamd

judyjoseph commented 2 years ago

@vivekreddynv Will take a look, are you seeing this on master or any specific release branch

vivekrnv commented 2 years ago

@vivekreddynv Will take a look, are you seeing this on master or any specific release branch

I've seen this on master.

judyjoseph commented 2 years ago

@vivekreddynv -- I don't find this additional SET before the DEL on 2 lab devices different platform running master image - I will give it a try again in 2700. Do you see this every time ?

admin@str2--acs-03:~$ sudo config portchannel member del PortChannel105 Ethernet8 admin@str2--acs-03:~$ sudo config portchannel member del PortChannel105 Ethernet12 admin@str2--acs-03:~$ sudo config portchannel del PortChannel105 admin@str2--acs-03:~$

1) "pmessage" 2) "keyspace@6:LAG" 3) "keyspace@6:LAG_MEMBER_TABLE|PortChannel105|Ethernet8" 4) "del" 1) "pmessage" 2) "keyspace@6:LAG" 3) "keyspace@6:LAG_MEMBER_TABLE|PortChannel105|Ethernet12" 4) "hset" 1) "pmessage" 2) "keyspace@6:LAG" 3) "keyspace@6:LAG_MEMBER_TABLE|PortChannel105|Ethernet12" 4) "del" 1) "pmessage" 2) "keyspace@6:LAG" 3) "keyspace@6:LAG_TABLE|PortChannel105" 4) "del"

Jan 31 02:35:35.127615 str2--acs-03 NOTICE teamd#teammgrd: :- removeLagMember: Remove Ethernet8 from port channel PortChannel105 Jan 31 02:35:46.009940 str2--acs-03 NOTICE teamd#teammgrd: :- removeLagMember: Remove Ethernet12 from port channel PortChannel105 Jan 31 02:36:06.463556 str2--acs-03 NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel105' has been removed. Jan 31 02:36:06.794032 str2--acs-03 INFO teamd#supervisord 2022-01-31 02:36:06,793 INFO reaped unknown pid 76 (exit status 0) Jan 31 02:36:06.892282 str2--acs-03 NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel105

vivekrnv commented 2 years ago

I've rechecked this on the master image, commit hash: 6100a6c8f91d2d65ee61e9212e73b1cc67537ea1 platform: x86_64-mlnx_msn3420-r0 and i still see the extra keyspace notification

root@sonic:/home/admin# config portchannel add PortChannel0003
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_TABLE|PortChannel0003"
4) "hset"
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_TABLE|PortChannel0003"
4) "hset"
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_TABLE|PortChannel0003"
4) "hset"
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_TABLE|PortChannel0003"
4) "hset"
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_TABLE|PortChannel0003"
4) "hset"

root@sonic:/home/admin# config portchannel member add PortChannel0003 Ethernet8
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_MEMBER_TABLE|PortChannel0003|Ethernet8"
4) "hset"
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_MEMBER_TABLE|PortChannel0003|Ethernet8"
4) "hset"

root@sonic:/home/admin# config portchannel member del PortChannel0003 Ethernet8
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_MEMBER_TABLE|PortChannel0003|Ethernet8"
4) "del"

root@sonic:/home/admin# config portchannel del PortChannel0003
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_TABLE|PortChannel0003"
4) "hset"
1) "pmessage"
2) "__keyspace@6__:*LAG*"
3) "__keyspace@6__:LAG_TABLE|PortChannel0003"
4) "del"

Syslog:

Jan 31 18:11:41.712157 sonic INFO kernel: [  162.152521] PortChannel0003: Mode changed to "loadbalance"
Jan 31 18:11:41.715750 sonic NOTICE teamd#teammgrd: :- addLag: Start port channel PortChannel0003 with teamd
Jan 31 18:11:41.748328 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0003 admin:1 oper:0 addr:1c:34:da:1c:9f:00 ifindex:148 master:0 type:team
Jan 31 18:11:41.749571 sonic NOTICE teamd#teammgrd: :- setLagAdminStatus: Set port channel PortChannel0003 admin status to up
Jan 31 18:11:41.750594 sonic NOTICE teamd#tlm_teamd: :- try_add_lag: The LAG 'PortChannel0003' has been added.
Jan 31 18:11:41.752131 sonic INFO kernel: [  162.192187] 8021q: adding VLAN 0 to HW filter on device PortChannel0003
Jan 31 18:11:41.755047 sonic NOTICE teamd#teammgrd: :- setLagMtu: Set port channel PortChannel0003 MTU to 9100
Jan 31 18:11:52.874400 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0003 admin:1 oper:0 addr:1c:34:da:1c:9f:00 ifindex:148 master:0 type:team
Jan 31 18:11:52.942069 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0003 admin:1 oper:0 addr:1c:34:da:1c:9f:00 ifindex:148 master:0 type:team
Jan 31 18:11:52.944863 sonic INFO kernel: [  173.385662] PortChannel0003: Port device Ethernet8 added
Jan 31 18:11:52.947198 sonic NOTICE swss#orchagent: message repeated 3 times: [ :- updatePortOperStatus: Port PortChannel0003 oper state set from down to down]
Jan 31 18:11:52.947813 sonic NOTICE teamd#teammgrd: :- addLagMember: Add Ethernet8 to port channel PortChannel0003
Jan 31 18:11:52.982479 sonic NOTICE swss#orchagent: :- addLagMember: Add member Ethernet8 to LAG PortChannel0003 lid:2000000000ca1 pid:1000000000682
Jan 31 18:11:52.986068 sonic NOTICE swss#orchagent: :- updatePortOperStatus: Port PortChannel0003 oper state set from down to down
Jan 31 18:12:07.856475 sonic INFO kernel: [  188.298699] PortChannel0003: Port device Ethernet8 removed
Jan 31 18:12:07.856628 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0003 admin:1 oper:0 addr:1c:34:da:1c:9f:00 ifindex:148 master:0 type:team
Jan 31 18:12:07.861736 sonic NOTICE teamd#teammgrd: :- removeLagMember: Remove Ethernet8 from port channel PortChannel0003
Jan 31 18:12:07.864490 sonic NOTICE swss#orchagent: :- updatePortOperStatus: Port PortChannel0003 oper state set from down to down
Jan 31 18:12:07.901081 sonic NOTICE swss#orchagent: :- removeLagMember: Remove member Ethernet8 from LAG PortChannel0003 lid:2000000000ca1 lmid:1b000000000ca2
Jan 31 18:12:18.251646 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0003 admin:0 oper:0 addr:1c:34:da:1c:9f:00 ifindex:148 master:0 type:team
Jan 31 18:12:18.253124 sonic NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:PortChannel0003 admin:0 oper:0 addr:1c:34:da:1c:9f:00 ifindex:148 master:0 type:team
Jan 31 18:12:18.259041 sonic NOTICE swss#orchagent: :- updatePortOperStatus: Port PortChannel0003 oper state set from down to down
Jan 31 18:12:18.261019 sonic ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel0003'. Skipping
Jan 31 18:12:18.265432 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0003' has been removed.
Jan 31 18:12:18.268606 sonic NOTICE swss#orchagent: :- removeLag: Remove LAG PortChannel0003 lid:2000000000ca1
Jan 31 18:12:18.752869 sonic NOTICE teamd#teammgrd: :- removeLag: Stop port channel PortChannel0003
dgsudharsan commented 2 years ago

@judyjoseph Can you please provide an update?

judyjoseph commented 2 years ago

@dgsudharsan yes I am now able to repro this. The additional hset is coming is mostly due to an admin down set for Portchannel interface that is getting triggered before interface delete. The CPU usage with master image is high, not sure if that is related - but will have to triage further.

dgsudharsan commented 2 years ago

@judyjoseph Thanks for your response. We are also facing some issues in master/202111 with high CPU. Has there been any suspect? Is the high CPU with teamd or in general

dgsudharsan commented 2 years ago

@judyjoseph Is there any update on this bug?

judyjoseph commented 2 years ago

@dgsudharsan, I could not spend more time on this, but saw this happens even when CPU was not high. I find this is due to interface admin state set to down, before the Portchannel interface is removed. Will need to check if we can avoid this.

zhangyanzhao commented 2 years ago

Can we fix the error message issue? "sonic ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel1111'. Skipping" @judyjoseph

Yuval-Mellanox commented 1 year ago

@zhangyanzhao Let's refresh the discussion on this one.