sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
701 stars 1.35k forks source link

[202205] config reload doesn't remove all the lags in a scaled configuration #19310

Open vivekrnv opened 3 weeks ago

vivekrnv commented 3 weeks ago

Description

On a system with scaled configuration of ports are portchannels, and ports, config reload doesn't remove all the portchannel netdevs and ASIC objects

Steps to reproduce the issue:

  1. Easiest way to repro would be to run the ecmp/inner_hashing/test_inner_hashing_lag.py test
  2. This test configures 88 lags + 4 (from default t0 config) on D112C8 SKU, so it has 120 ports and does config reload in the end. After the test finishes some lags that are not present in config DB are seen Issue is seen in hash: https://github.com/sonic-net/sonic-buildimage/commit/442fe3e7b405725727c8f04e4e383c8c84775ea6 Not reproduced in the following hash: https://github.com/sonic-net/sonic-buildimage/commit/58391e3d23ea328dbad97562deed50e855fef122
/usr/local/bin/py.test ecmp/inner_hashing/test_inner_hashing_lag.py --inventory="../ansible/inventory,../ansible/veos" --host-pattern r-tigris-25 --module-path ../ansible/library/ --testbed r-tigris-25-t0-120 --testbed_file ../ansible/testbed.csv --allow_recover --session_id 8704656 --mars_key_id 0.10.1.1.15.1.1.2.2.1.1 --junit-xml junit_8704656_0.10.1.1.15.1.1.2.2.1.1.xml --assert plain --log-cli-level debug --show-capture=no -ra --showlocals --clean-alluredir -k="test_inner_hashing[ipv4-ipv6]"

Describe the results you received:

After config reload, PortChannel101-104 are written from CONFIG_DB, rest are netdevs creted by teamsyncd

Jun 10 06:50:09.959223 r-tigris-25 INFO python[117618]: ansible-command Invoked with executable=/bin/bash _uses_shell=True _raw_params=config reload -y -f
2024-06-10.03:52:19.548657|LAG_TABLE:PortChannel65|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548679|LAG_TABLE:PortChannel6|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548688|LAG_TABLE:PortChannel59|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548695|LAG_TABLE:PortChannel103|SET|mtu:9100|tpid:0x8100|admin_status:up|oper_status:down
2024-06-10.03:52:19.548702|LAG_TABLE:PortChannel66|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548710|LAG_TABLE:PortChannel82|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548717|LAG_TABLE:PortChannel79|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548724|LAG_TABLE:PortChannel76|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548731|LAG_TABLE:PortChannel104|SET|mtu:9100|tpid:0x8100|admin_status:up|oper_status:down
2024-06-10.03:52:19.548738|LAG_TABLE:PortChannel69|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548745|LAG_TABLE:PortChannel77|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548752|LAG_TABLE:PortChannel8|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548758|LAG_TABLE:PortChannel81|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548765|LAG_TABLE:PortChannel7|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548772|LAG_TABLE:PortChannel78|SET|admin_status:up|oper_status:up|mtu:9100
2024-06-10.03:52:19.548779|LAG_TABLE:PortChannel68|SET|admin_status:up|oper_status:up|mtu:9100

Logs like this are seen in syslog

Jun 10 06:50:16.884985 r-tigris-25 WARNING kernel: [ 5992.778730] PortChannel61: Failed to send options change via netlink (err -105)
Jun 10 06:50:16.884997 r-tigris-25 WARNING kernel: [ 5992.778904] PortChannel61: Failed to send port change of device Ethernet52 via netlink (err -105)
Jun 10 06:50:16.885000 r-tigris-25 INFO kernel: [ 5992.779028] PortChannel61: Port device Ethernet52 removed
Jun 10 06:50:16.885016 r-tigris-25 WARNING kernel: [ 5992.779670] PortChannel58: Failed to send options change via netlink (err -105)

Describe the results you expected:

No lag's without config must be present

prabhataravind commented 2 weeks ago

@vivekrnv is this seen only on 202205? How about 202305 or master?

prabhataravind commented 2 weeks ago

@yxieca could you please have someone help with this issue?

yxieca commented 2 weeks ago

yinxi@yinxi-vm0:~/src/sonic-202311$ git hist 58391e3..442fe3e

yxieca commented 2 weeks ago

@saiarcot895 the change list between the named hashes doesn't have apparent answer. Can you take a look?

AntonHryshchuk commented 1 week ago

@vivekrnv is this seen only on 202205? How about 202305 or master?

We saw it also on 202305.

dgsudharsan commented 6 days ago

Issue is seen in 202311 as well