Open shi-su opened 3 years ago
can you pass logs what happens during that 250s gap? full sairedis.rec and syslog please
can you pass logs what happens during that 250s gap? full sairedis.rec and syslog please
Sairedis recording and syslog attached. sairedis.rec.txt syslog.1.txt syslog.txt
From your logs:
2021-07-23.22:57:27.269997|a|INIT_VIEW
2021-07-23.22:57:31.507361|A|SAI_STATUS_SUCCESS
...
2021-07-23.22:57:40.925386|a|APPLY_VIEW
2021-07-23.22:57:40.926278|A|SAI_STATUS_SUCCESS
logs are saying it took 13 seconds from init view to apply view, that's really fast, not sure where did you get 250second gap can you point timestamps in syslog/sairedis where you observed that ?
I also see a lot of logs like: brcm_sai_create_fdb_entry, brcm_sai_nbr_mac_db_update on different macs, couple 100 per second, not sure if that test design or a brcm bug
you didn't provide right logs:
Jul 23 22:57:31.506839 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- processNotifySyncd: very first run is TRUE, op = INIT_VIEW
this line shows that this is first cold boot, not warm boot
I also see a lot of logs like: brcm_sai_create_fdb_entry, brcm_sai_nbr_mac_db_update on different macs, couple 100 per second, not sure if that test design or a brcm bug
From your logs:
2021-07-23.22:57:27.269997|a|INIT_VIEW 2021-07-23.22:57:31.507361|A|SAI_STATUS_SUCCESS ... 2021-07-23.22:57:40.925386|a|APPLY_VIEW 2021-07-23.22:57:40.926278|A|SAI_STATUS_SUCCESS
logs are saying it took 13 seconds from init view to apply view, that's really fast, not sure where did you get 250second gap can you point timestamps in syslog/sairedis where you observed that ?
This is actually fast reboot, dataplane goes down at create switch and comes back around the default route to be up. This round is a bit faster, around 180s.
2021-07-23.22:57:31.510955|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x5601a3a9ee50|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x5601a3a9ee60|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x5601a3a9ee70|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=94:8E:D3:05:34:E4
...
2021-07-23.23:00:37.692568|S|SAI_OBJECT_TYPE_ROUTE_ENTRY||{"dest":"0.0.0.0/0","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000024"}|SAI_ROUTE_ENTRY_ATTR_PACKET_ACTION=SAI_PACKET_ACTION_FORWARD||{"dest":"0.0.0.0/0","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000024"}|SAI_ROUTE_ENTRY_ATTR_NEXT_HOP_ID=oid:0x40000000008d9
you didn't provide right logs:
Jul 23 22:57:31.506839 str2-7050cx3-acs-01 NOTICE syncd#syncd: :- processNotifySyncd: very first run is TRUE, op = INIT_VIEW
this line shows that this is first cold boot, not warm boot
This is fast reboot, it basically behaves like a cold boot with a few things pre-configured.
there are a lot of fdb event learning/aged:
2021-07-23.22:58:51.754371|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x2600000000063e\",\"mac\":\"00:06:07:08:09:0A\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_AGED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a00000000065d"}]}]|
2021-07-23.22:58:51.754733|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x2600000000063e\",\"mac\":\"00:06:07:08:09:0A\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_LEARNED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a00000000065b"}]}]|
not sure if that's expected, this should have no impact this is the biggest gap:
2021-07-23.22:57:45.714704|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd00000000062c|SAI_HOSTIF_ATTR_OPER_STATUS=true
2021-07-23.23:00:37.605004|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063a|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false
i think this is the same issue as in mlnx2700 boot case, look where lag was enabled:
2021-07-23.23:00:37.605004|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063a|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false
2021-07-23.23:00:37.605104|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063a|SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=false
2021-07-23.23:00:37.616288|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063b|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false
2021-07-23.23:00:37.616577|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063b|SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=false
2021-07-23.23:00:37.616764|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063d|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false
2021-07-23.23:00:37.616878|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063d|SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=false
2021-07-23.23:00:37.636698|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063c|SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE=false
2021-07-23.23:00:37.636818|s|SAI_OBJECT_TYPE_LAG_MEMBER:oid:0x1b00000000063c|SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=false
after that only default route is starting to be in forward state
take alook here wene we receive port up notification from portsyncd:
Jul 23 23:00:35.563537 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet12 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:16 master:9
...
Jul 23 23:00:37.603230 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet112 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:42 master:5
Jul 23 23:00:37.603407 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0001 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:5 master:0 type:team
Jul 23 23:00:37.612179 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet124 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:44 master:8
Jul 23 23:00:37.612179 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0004 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:8 master:0 type:team
Jul 23 23:00:37.613111 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet116 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:41 master:6
Jul 23 23:00:37.613111 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0002 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:6 master:0 type:team
Jul 23 23:00:37.638433 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet120 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:43 master:7
Jul 23 23:00:37.638540 str2-7050cx3-acs-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel0003 admin:1 oper:1 addr:94:8e:d3:05:34:e4 ifindex:7 master:0 type:team
and timestamp is almost exact when we receive that from kernel here;
Jul 23 23:00:35.563700 str2-7050cx3-acs-01 INFO kernel: [ 225.116416] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet12: link becomes ready
Jul 23 23:00:35.563713 str2-7050cx3-acs-01 INFO kernel: [ 225.116525] Bridge: port 2(Ethernet12) entered blocking state
Jul 23 23:00:35.563717 str2-7050cx3-acs-01 INFO kernel: [ 225.116533] Bridge: port 2(Ethernet12) entered forwarding state
Jul 23 23:00:35.563721 str2-7050cx3-acs-01 INFO kernel: [ 225.118674] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet20: link becomes ready
...
Jul 23 23:00:35.576510 str2-7050cx3-acs-01 INFO kernel: [ 225.128205] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet112: link becomes ready
Jul 23 23:00:35.576544 str2-7050cx3-acs-01 INFO kernel: [ 225.128490] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet120: link becomes ready
Jul 23 23:00:35.576548 str2-7050cx3-acs-01 INFO kernel: [ 225.128757] IPv6: ADDRCONF(NETDEV_CHANGE): Ethernet124: link becomes ready
Jul 23 23:00:37.603906 str2-7050cx3-acs-01 INFO kernel: [ 227.159003] IPv6: ADDRCONF(NETDEV_CHANGE): PortChannel0001: link becomes ready
Jul 23 23:00:37.611873 str2-7050cx3-acs-01 INFO kernel: [ 227.167020] IPv6: ADDRCONF(NETDEV_CHANGE): PortChannel0004: link becomes ready
Jul 23 23:00:37.615894 str2-7050cx3-acs-01 INFO kernel: [ 227.168615] IPv6: ADDRCONF(NETDEV_CHANGE): PortChannel0002: link becomes ready
Jul 23 23:00:37.635696 str2-7050cx3-acs-01 INFO kernel: [ 227.190319] IPv6: ADDRCONF(NETDEV_CHANGE): PortChannel0003: link becomes ready
seems like we are waiting for kernel to bring those ports up, why? no clue
I tried to disable the flex counter part and run fast reboot in async mode. The buffer pool clear stats and queue operations are no longer there. However, lag still comes up late. So the next investigation is to check why there is a delay in lag.
I also noticed that with flex counter disabled, the test passed with ~20s downtime once (while other attempts failed with >200s downtime). So there seems to be a race condition which may stuck the lag configuration.
but problem seems to be that notifications are coming from kernel late, to portsyncd
but problem seems to be that notifications are coming from kernel late, to portsyncd
I understand your point. The interesting part here is that it works fine in sync mode. I think a missing puzzle is how sync/async mode could affect these behaviors.
why would sync/asyncd mode affect portsyncd at all?
why would sync/asyncd mode affect portsyncd at all?
I do not think there is any connection as well. However, the results show there is a difference: async mode (fail with the issue), sync mode (work fine).
can you reproduce this 100% time? if yes then you can narrow this down
can you reproduce this 100% time? if yes then you can narrow this down
It is probably not 100%, but should be more than 80% of the time.
so if not, then there is also something else involved
Description
Fast reboot experiences a long downtime (~250s) in asynchronous mode.
Steps to reproduce the issue:
Describe the results you received:
Downtime of ~250s exceeds the 30-second limit.
Describe the results you expected:
Downtime within 30 seconds.
Output of
show version
:202012.11 or later
Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
The issue is currently seen on Broadcom platforms.