sonic-net / sonic-swss

SONiC Switch State Service (SwSS)
https://azure.github.io/SONiC
Other
171 stars 521 forks source link

[warm-reboot] lag mtu was deleted from app_db after system warm-reboot #888

Open leoli-nps opened 5 years ago

leoli-nps commented 5 years ago
<1> Top ``` SW1 SW2 Ethernet121 -------- Ethernet121 Ethernet122 -------- Ethernet122 Ethernet123 -------- Ethernet123 ``` <2> Config ``` SW1: "PORTCHANNEL": { "PortChannel0001": { "admin_status": "up", "mtu": "9100" } }, "PORTCHANNEL_MEMBER": { "PortChannel0001|Ethernet121": {}, "PortChannel0001|Ethernet122": {}, "PortChannel0001|Ethernet123": {} } SW2: "PORTCHANNEL": { "PortChannel0001": { "admin_status": "up", "mtu": "9100" } }, "PORTCHANNEL_MEMBER": { "PortChannel0001|Ethernet121": {}, "PortChannel0001|Ethernet122": {}, "PortChannel0001|Ethernet123": {} } ``` <3> Get information about PortChannel0001 in APP_DB before warm-reboot ``` admin@sonic:~$ show interfaces portchannel Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available, S - selected, D - deselected No. Team Dev Protocol Ports ----- --------------- ----------- -------------------------------------------- 0001 PortChannel0001 LACP(A)(Up) Ethernet123(S) Ethernet122(S) Ethernet121(S) admin@sonic:~$ redis-cli -n 4 hgetall "PORTCHANNEL|PortChannel0001" 1) "admin_status" 2) "up" 3) "mtu" 4) "9100" admin@sonic:~$ redis-cli -n 0 hgetall "LAG_TABLE:PortChannel0001" 1) "admin_status" 2) "up" 3) "oper_status" 4) "up" 5) "mtu" 6) "9100" admin@sonic:~$ ``` <4> Execute command `sudo warm-reboot` ``` admin@sonic:~$ sudo warm-reboot ``` <5> Get information about PortChannel0001 in APP_DB after warm-reboot ``` admin@sonic:~$ show interfaces portchannel Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available, S - selected, D - deselected No. Team Dev Protocol Ports ----- --------------- ----------- -------------------------------------------- 0001 PortChannel0001 LACP(A)(Up) Ethernet123(S) Ethernet122(S) Ethernet121(S) admin@sonic:~$ redis-cli -n 4 hgetall "PORTCHANNEL|PortChannel0001" 1) "admin_status" 2) "up" 3) "mtu" 4) "9100" admin@sonic:~$ redis-cli -n 0 hgetall "LAG_TABLE:PortChannel0001" 1) "admin_status" 2) "up" 3) "oper_status" 4) "up" admin@sonic:~$ ``` <6> Debug In fact, in the initial period after warm-reboot, lag mtu still exists in app_db, but after about a minute, it is gone; look at the swss.rec file, you can see the following information: ``` admin@sonic:~$ sudo grep LAG_TABLE:PortChannel0001 /var/log/swss/swss.rec 2019-05-13.11:26:14.466114|LAG_TABLE:PortChannel0001|SET|admin_status:up|oper_status:down 2019-05-13.11:26:14.483630|LAG_TABLE:PortChannel0001|SET|admin_status:up|oper_status:down 2019-05-13.11:26:14.503710|LAG_TABLE:PortChannel0001|SET|admin_status:up|oper_status:down 2019-05-13.11:26:14.694809|LAG_TABLE:PortChannel0001|SET|admin_status:up|oper_status:up 2019-05-13.11:27:57.451577|LAG_TABLE:PortChannel0001|SET|admin_status:up|oper_status:up 2019-05-13.11:27:57.451758|LAG_TABLE:PortChannel0001|SET|mtu:9100 2019-05-13.11:30:33.269369|LAG_TABLE:PortChannel0001|SET|admin_status:up|oper_status:up|mtu:9100 2019-05-13.11:30:34.038090|LAG_TABLE:PortChannel0001|SET|mtu:9100 2019-05-13.11:31:34.401449|LAG_TABLE:PortChannel0001|SET|admin_status:up|oper_status:up admin@sonic:~$ ``` **show version** ``` admin@sonic:~$ show version SONiC Software Version: SONiC.origin_201811.0-dirty-20190418.223441 Distribution: Debian 9.8 Kernel: 4.9.0-8-amd64 Build commit: 051bb23 Build date: Fri Apr 19 06:33:08 UTC 2019 Built by: simon@nps65 Docker images: REPOSITORY TAG IMAGE ID SIZE docker-syncd-nephos latest 1c3500846360 326MB docker-syncd-nephos origin_201811.0-dirty-20190418.223441 1c3500846360 326MB docker-orchagent-nephos latest f9c367fb5fc5 368MB docker-orchagent-nephos origin_201811.0-dirty-20190418.223441 f9c367fb5fc5 368MB docker-teamd latest 8a6898e1dfa7 353MB docker-teamd origin_201811.0-dirty-20190418.223441 8a6898e1dfa7 353MB docker-fpm-quagga latest de4a2a321623 372MB docker-fpm-quagga origin_201811.0-dirty-20190418.223441 de4a2a321623 372MB docker-lldp-sv2 latest 7c53844507f0 294MB docker-lldp-sv2 origin_201811.0-dirty-20190418.223441 7c53844507f0 294MB docker-dhcp-relay latest 903f08df67cf 258MB docker-dhcp-relay origin_201811.0-dirty-20190418.223441 903f08df67cf 258MB docker-database latest 2b048aa0fe97 255MB docker-database origin_201811.0-dirty-20190418.223441 2b048aa0fe97 255MB docker-snmp-sv2 latest b42a83fc56f8 330MB docker-snmp-sv2 origin_201811.0-dirty-20190418.223441 b42a83fc56f8 330MB docker-router-advertiser latest b6b8150e559a 254MB docker-router-advertiser origin_201811.0-dirty-20190418.223441 b6b8150e559a 254MB docker-platform-monitor latest f8442c4d55a8 297MB docker-platform-monitor origin_201811.0-dirty-20190418.223441 f8442c4d55a8 297MB admin@sonic:~$ ``` **Attach debug file `sudo generate_dump`:** [sonic_dump_sonic_20190513_114008.tar.gz](https://github.com/Azure/sonic-swss/files/3172672/sonic_dump_sonic_20190513_114008.tar.gz) Signed-off-by: leo.li leo.li@nephosinc.com
prsunny commented 5 years ago

This is a known issue that may happen during bootup. It is supposed to be fixed as part of https://github.com/Azure/sonic-buildimage/pull/2829. Can you check if you have this fix?

leoli-nps commented 5 years ago

@prsunny Thank you for your reply. I checked it, we have not merged this fix yet. However, I made the corresponding changes directly on the device, as follows:

admin@sonic:~$ cat /etc/systemd/system/teamd.service
[Unit]
Description=TEAMD container
Requires=updategraph.service
After=updategraph.service swss.service
Before=ntp-config.service

[Service]
User=admin
ExecStartPre=/usr/bin/teamd.sh start
ExecStart=/usr/bin/teamd.sh wait
ExecStop=/usr/bin/teamd.sh stop

[Install]
WantedBy=multi-user.target
admin@sonic:~$

Then execute warm-reboot, but the phenomenon is still the same as described above. I think they should be two different issues.

Further, I checked the code of teamsyncd. When executing warm-reboot, the lag information from the kernel will be written to m_tempViewState instead of APP_DB. After 70 seconds (DEFAULT_WR_PENDING_TIMEOUT), execute applyState(). But currently, the information that needs to be synchronized from the kernel is only admin_status and oper_status, no mtu, I think the problem may be here. Hope to help.

arvindkv-bf commented 4 years ago

@prsunny , @leoli-nps I am also observing the above issue with SONiC image - April 2020/201911 branch. Can you please confirn if this is fixed. After warm-reboot the MTU for LAG rif Interface is getting changed to default 1492 from 9100. APP_DB - Before and After warm reboot Before: "LAG_TABLE:PortChannel101": { "type": "hash", "value": { "admin_status": "up", "mtu": "9100", "oper_status": "up" } }, "LAG_TABLE:PortChannel201": { "type": "hash", "value": { "admin_status": "up", "mtu": "9100", "oper_status": "up" } }, After: "LAG_TABLE:PortChannel101": { "type": "hash", "value": { "admin_status": "up", "oper_status": "up" } }, "LAG_TABLE:PortChannel201": { "type": "hash", "value": { "admin_status": "up", "oper_status": "up" } }, Config_DB: "PORTCHANNEL|PortChannel101": { "type": "hash", "value": { "admin_status": "up", "members@": "Ethernet68", "min_links": "1", "mtu": "9100" } },

"PORTCHANNEL|PortChannel201": {
    "type": "hash",
    "value": {
        "admin_status": "up",
        "members@": "Ethernet252",
        "min_links": "1",
        "mtu": "9100"
    }
},

SONiC Software Version: SONiC.201911.470-dirty-20200413.175026 Distribution: Debian 9.12 Kernel: 4.9.0-11-2-amd64 Build commit: d09fba37 Build date: Tue Apr 14 02:45:44 UTC 2020 Built by: nd@mavtest2-bxdsw

Platform: x86_64-accton_wedge100bf_65x-r0 HwSKU: mavericks ASIC: barefoot Serial Number: AH47011410 Uptime: 19:07:37 up 21:04, 4 users, load average: 2.35, 2.30, 2.28

Docker images: REPOSITORY TAG IMAGE ID SIZE docker-syncd-bfn 201911.470-dirty-20200413.175026 67693f0ec154 807MB docker-syncd-bfn latest 67693f0ec154 807MB docker-router-advertiser 201911.470-dirty-20200413.175026 aacf0c7bbe7d 283MB docker-router-advertiser latest aacf0c7bbe7d 283MB docker-platform-monitor 201911.470-dirty-20200413.175026 9d05be095518 334MB docker-platform-monitor latest 9d05be095518 334MB docker-fpm-frr 201911.470-dirty-20200413.175026 9b20037b8a53 327MB docker-fpm-frr latest 9b20037b8a53 327MB docker-sflow 201911.470-dirty-20200413.175026 988e7952291f 307MB docker-sflow latest 988e7952291f 307MB docker-lldp-sv2 201911.470-dirty-20200413.175026 18da217cfad7 304MB docker-lldp-sv2 latest 18da217cfad7 304MB docker-dhcp-relay 201911.470-dirty-20200413.175026 84bf3d863621 293MB docker-dhcp-relay latest 84bf3d863621 293MB docker-database 201911.470-dirty-20200413.175026 b05010a9876e 283MB docker-database latest b05010a9876e 283MB docker-snmp-sv2 201911.470-dirty-20200413.175026 eadb4ac374ca 340MB docker-snmp-sv2 latest eadb4ac374ca 340MB docker-orchagent 201911.470-dirty-20200413.175026 9cacaacdf877 325MB docker-orchagent latest 9cacaacdf877 325MB docker-teamd 201911.470-dirty-20200413.175026 787bee61d7db 307MB docker-teamd latest 787bee61d7db 307MB docker-nat 201911.470-dirty-20200413.175026 bc381c4411a4 309MB docker-nat latest bc381c4411a4 309MB