sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
736 stars 1.42k forks source link

[fast-reboot] Long fast reboot downtime in asynchronous mode #8569

Open shi-su opened 3 years ago

shi-su commented 3 years ago

Description

Fast reboot experiences a long downtime (~250s) in asynchronous mode.

Steps to reproduce the issue:

  1. Disable synchronous mode
  2. Run fast reboot test

Describe the results you received:

Downtime of ~250s exceeds the 30-second limit.

Describe the results you expected:

Downtime within 30 seconds.

Output of show version:

202012.11 or later

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The issue is currently seen on Broadcom platforms.

kcudnik commented 3 years ago

can you pass logs what happens during that 250s gap? full sairedis.rec and syslog please

shi-su commented 3 years ago

can you pass logs what happens during that 250s gap? full sairedis.rec and syslog please

Sairedis recording and syslog attached. sairedis.rec.txt syslog.1.txt syslog.txt

kcudnik commented 3 years ago

From your logs:

2021-07-23.22:57:27.269997|a|INIT_VIEW
2021-07-23.22:57:31.507361|A|SAI_STATUS_SUCCESS
...
2021-07-23.22:57:40.925386|a|APPLY_VIEW
2021-07-23.22:57:40.926278|A|SAI_STATUS_SUCCESS

logs are saying it took 13 seconds from init view to apply view, that's really fast, not sure where did you get 250second gap can you point timestamps in syslog/sairedis where you observed that ?

kcudnik commented 3 years ago

I also see a lot of logs like: brcm_sai_create_fdb_entry, brcm_sai_nbr_mac_db_update on different macs, couple 100 per second, not sure if that test design or a brcm bug

kcudnik commented 3 years ago

you didn't provide right logs:

Jul 23 22:57:31.506839 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- processNotifySyncd: very first run is TRUE, op = INIT_VIEW

this line shows that this is first cold boot, not warm boot

shi-su commented 3 years ago

I also see a lot of logs like: brcm_sai_create_fdb_entry, brcm_sai_nbr_mac_db_update on different macs, couple 100 per second, not sure if that test design or a brcm bug

From your logs:

2021-07-23.22:57:27.269997|a|INIT_VIEW
2021-07-23.22:57:31.507361|A|SAI_STATUS_SUCCESS
...
2021-07-23.22:57:40.925386|a|APPLY_VIEW
2021-07-23.22:57:40.926278|A|SAI_STATUS_SUCCESS

logs are saying it took 13 seconds from init view to apply view, that's really fast, not sure where did you get 250second gap can you point timestamps in syslog/sairedis where you observed that ?

This is actually fast reboot, dataplane goes down at create switch and comes back around the default route to be up. This round is a bit faster, around 180s.

2021-07-23.22:57:31.510955|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x5601a3a9ee50|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x5601a3a9ee60|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x5601a3a9ee70|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=94:8E:D3:05:34:E4

...

2021-07-23.23:00:37.692568|S|SAI_OBJECT_TYPE_ROUTE_ENTRY||{"dest":"0.0.0.0/0","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000024"}|SAI_ROUTE_ENTRY_ATTR_PACKET_ACTION=SAI_PACKET_ACTION_FORWARD||{"dest":"0.0.0.0/0","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000024"}|SAI_ROUTE_ENTRY_ATTR_NEXT_HOP_ID=oid:0x40000000008d9
shi-su commented 3 years ago

you didn't provide right logs:

Jul 23 22:57:31.506839 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- processNotifySyncd: very first run is TRUE, op = INIT_VIEW

this line shows that this is first cold boot, not warm boot

This is fast reboot, it basically behaves like a cold boot with a few things pre-configured.

kcudnik commented 3 years ago

there are a lot of fdb event learning/aged:

2021-07-23.22:58:51.754371|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x2600000000063e\",\"mac\":\"00:06:07:08:09:0A\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_AGED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a00000000065d"}]}]|
2021-07-23.22:58:51.754733|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x2600000000063e\",\"mac\":\"00:06:07:08:09:0A\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_LEARNED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a00000000065b"}]}]|

not sure if that's expected, this should have no impact this is the biggest gap:

2021-07-23.22:57:45.714704|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd00000000062c|SAI_HOSTIF_ATTR_OPER_STATUS=true
2021-07-23.23:00:37.605004|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063a|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false

i think this is the same issue as in mlnx2700 boot case, look where lag was enabled:

2021-07-23.23:00:37.605004|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063a|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false
2021-07-23.23:00:37.605104|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063a|SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=false
2021-07-23.23:00:37.616288|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063b|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false
2021-07-23.23:00:37.616577|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063b|SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=false
2021-07-23.23:00:37.616764|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063d|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false
2021-07-23.23:00:37.616878|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063d|SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=false
2021-07-23.23:00:37.636698|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063c|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false
2021-07-23.23:00:37.636818|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063c|SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=false

after that only default route is starting to be in forward state

take alook here wene we receive port up notification from portsyncd:

Jul 23 23:00:35.563537 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet12 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:16 master:9
...
Jul 23 23:00:37.603230 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet112 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:42 master:5
Jul 23 23:00:37.603407 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0001 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:5 master:0 type:team
Jul 23 23:00:37.612179 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet124 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:44 master:8
Jul 23 23:00:37.612179 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0004 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:8 master:0 type:team
Jul 23 23:00:37.613111 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet116 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:41 master:6
Jul 23 23:00:37.613111 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0002 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:6 master:0 type:team
Jul 23 23:00:37.638433 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet120 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:43 master:7
Jul 23 23:00:37.638540 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0003 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:7 master:0 type:team

and timestamp is almost exact when we receive that from kernel here;

Jul 23 23:00:35.563700 str2-7050cx3-acs-01 INFO kernel: [  225.116416] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet12: link becomes ready
Jul 23 23:00:35.563713 str2-7050cx3-acs-01 INFO kernel: [  225.116525] Bridge: port 2(Ethernet12) entered blocking state
Jul 23 23:00:35.563717 str2-7050cx3-acs-01 INFO kernel: [  225.116533] Bridge: port 2(Ethernet12) entered forwarding state
Jul 23 23:00:35.563721 str2-7050cx3-acs-01 INFO kernel: [  225.118674] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet20: link becomes ready
...
Jul 23 23:00:35.576510 str2-7050cx3-acs-01 INFO kernel: [  225.128205] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet112: link becomes ready
Jul 23 23:00:35.576544 str2-7050cx3-acs-01 INFO kernel: [  225.128490] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet120: link becomes ready
Jul 23 23:00:35.576548 str2-7050cx3-acs-01 INFO kernel: [  225.128757] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet124: link becomes ready
Jul 23 23:00:37.603906 str2-7050cx3-acs-01 INFO kernel: [  227.159003] IPv6: ADDRCONF(NETDEV_CHANGE): PortChannel0001: link becomes ready
Jul 23 23:00:37.611873 str2-7050cx3-acs-01 INFO kernel: [  227.167020] IPv6: ADDRCONF(NETDEV_CHANGE): PortChannel0004: link becomes ready
Jul 23 23:00:37.615894 str2-7050cx3-acs-01 INFO kernel: [  227.168615] IPv6: ADDRCONF(NETDEV_CHANGE): PortChannel0002: link becomes ready
Jul 23 23:00:37.635696 str2-7050cx3-acs-01 INFO kernel: [  227.190319] IPv6: ADDRCONF(NETDEV_CHANGE): PortChannel0003: link becomes ready

seems like we are waiting for kernel to bring those ports up, why? no clue

shi-su commented 3 years ago

I tried to disable the flex counter part and run fast reboot in async mode. The buffer pool clear stats and queue operations are no longer there. However, lag still comes up late. So the next investigation is to check why there is a delay in lag.

I also noticed that with flex counter disabled, the test passed with ~20s downtime once (while other attempts failed with >200s downtime). So there seems to be a race condition which may stuck the lag configuration.

kcudnik commented 3 years ago

but problem seems to be that notifications are coming from kernel late, to portsyncd

shi-su commented 3 years ago

but problem seems to be that notifications are coming from kernel late, to portsyncd

I understand your point. The interesting part here is that it works fine in sync mode. I think a missing puzzle is how sync/async mode could affect these behaviors.

kcudnik commented 3 years ago

why would sync/asyncd mode affect portsyncd at all?

shi-su commented 3 years ago

why would sync/asyncd mode affect portsyncd at all?

I do not think there is any connection as well. However, the results show there is a difference: async mode (fail with the issue), sync mode (work fine).

kcudnik commented 3 years ago

can you reproduce this 100% time? if yes then you can narrow this down

shi-su commented 3 years ago

can you reproduce this 100% time? if yes then you can narrow this down

It is probably not 100%, but should be more than 80% of the time.

kcudnik commented 3 years ago

so if not, then there is also something else involved