sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
185 stars 704 forks source link

[link_flap/test_cont_link_flap.py] system could be not reaching stable status while BGP converged. #10955

Open keboliu opened 9 months ago

keboliu commented 9 months ago

Description

This test case assumes that after all route entries are re-learned and BGP has converged, the system has reached a stable status. To measure this, the test will compare memory and CPU usage before and after the test starts, to determine whether the test passes. However, simply measuring BGP convergence may not be enough, as other protocols and daemons also need to stabilize after port flapping. For example, on one system, even though BGP had converged, it hadn't finished writing route entries to the ASIC, indicating it was not yet stable.

test case log indicates that BGP converged:

07:27:39 link_flap_utils.check_bgp_routes L0233 INFO | IPv4 routes: start 6410 end 6410, summary {'connected': {'routes': 7, 'FIB': 7}, 'ebgp': {'routes': 6403, 'FIB': 6403}, 'ibgp': {'routes': 0, 'FIB': 0}, 'Totals': {'routes': 6410, 'FIB': 6410}}
07:27:39 link_flap_utils.check_bgp_routes L0234 INFO | IPv6 routes: start 6447 end 6447, summary {'kernel': {'routes': 1, 'FIB': 1}, 'connected': {'routes': 42, 'FIB': 42}, 'static': {'routes': 1, 'FIB': 1}, 'ebgp': {'routes': 6403, 'FIB': 6403}, 'ibgp': {'routes': 0, 'FIB': 0}, 'Totals': {'routes': 6447, 'FIB': 6447}}

switch syslog shows that it still working on writing route entries:

Dec  4 07:27:39.099463 sonic INFO python[173623]: ansible-command Invoked with _raw_params=show ipv6 route kernel _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Dec  4 07:27:40.069207 sonic INFO python[173744]: ansible-command Invoked with _raw_params=show system-memory _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Dec  4 07:27:40.080244 sonic NOTICE syncd#SDK: [SAI_UTILS.NOTICE] ./src/mlnx_sai_utils.c[2319]- mlnx_bulk_statuses_print: [BULK_SET] 1000 Routes: 1000 success, 0 not executed, 0 failed
Dec  4 07:27:40.424019 sonic NOTICE syncd#SDK: [SAI_UTILS.NOTICE] ./src/mlnx_sai_utils.c[2319]- mlnx_bulk_statuses_print: [BULK_SET] 1000 Routes: 1000 success, 0 not executed, 0 failed
Dec  4 07:27:40.424019 sonic NOTICE syncd#SDK: :- threadFunction: time span 101 ms for 'bulkset:SAI_OBJECT_TYPE_ROUTE_ENTRY:1000'
Dec  4 07:27:40.518729 sonic NOTICE syncd#SDK: [SAI_UTILS.NOTICE] ./src/mlnx_sai_utils.c[2319]- mlnx_bulk_statuses_print: [BULK_SET] 1000 Routes: 1000 success, 0 not executed, 0 failed
Dec  4 07:27:40.698630 sonic INFO python[173840]: ansible-command Invoked with _raw_params=COLUMNS=512 show processes cpu | grep orchagent | awk '{print $1, $9}' _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Dec  4 07:27:40.834659 sonic INFO swss#supervisord: arp_update ping6: Warning: source address might be selected on device other than: PortChannel101
Dec  4 07:27:40.838785 sonic INFO swss#supervisord: arp_update ping6: Warning: source address might be selected on device other than: PortChannel102
Dec  4 07:27:40.842978 sonic INFO swss#supervisord: arp_update ping6: Warning: source address might be selected on device other than: PortChannel103
Dec  4 07:27:40.847400 sonic INFO swss#supervisord: arp_update ping6: Warning: source address might be selected on device other than: PortChannel104
Dec  4 07:27:40.904402 sonic INFO swss#supervisord: arp_update ping6: Warning: source address might be selected on device other than: Vlan1000
Dec  4 07:27:41.180576 sonic NOTICE syncd#SDK: message repeated 2 times: [ [SAI_UTILS.NOTICE] ./src/mlnx_sai_utils.c[2319]- mlnx_bulk_statuses_print: [BULK_SET] 1000 Routes: 1000 success, 0 not executed, 0 failed]
Dec  4 07:27:41.180576 sonic NOTICE syncd#SDK: [SAI_UTILS.NOTICE] ./src/mlnx_sai_utils.c[2319]- mlnx_bulk_statuses_print: [BULK_SET] 950 Routes: 950 success, 0 not executed, 0 failed
Dec  4 07:27:41.509547 sonic INFO python[173979]: ansible-command Invoked with _raw_params=redis-cli info memory | grep used_memory_human | sed -e 's/.*:\(.*\)M/\1/' _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Dec  4 07:27:41.796545 sonic NOTICE swss#orchagent: :- addRoutePost: Update Nexthop Group fc00::72@PortChannel101,fc00::76@PortChannel102,fc00::7a@PortChannel103
Dec  4 07:27:41.797013 sonic NOTICE swss#orchagent: :- removeNextHopGroup: Delete next hop group fc00::72@PortChannel101,fc00::76@PortChannel102,fc00::7a@PortChannel103
Dec  4 07:27:41.797789 sonic NOTICE syncd#SDK: [SAI_NEXT_HOP_GROUP.NOTICE] ./src/mlnx_sai_nexthopgroup.c[4616]- mlnx_remove_next_hop_group_members: Remove NEXT_HOP_GROUP_MEMBER [OID:0x2090000002D] [ID:8]
Dec  4 07:27:41.797800 sonic NOTICE syncd#SDK: [SAI_NEXT_HOP_GROUP.NOTICE] ./src/mlnx_sai_nexthopgroup.c[4616]- mlnx_remove_next_hop_group_members: Remove NEXT_HOP_GROUP_MEMBER [OID:0x1890000002D] [ID:6]
Dec  4 07:27:41.797800 sonic NOTICE syncd#SDK: [SAI_NEXT_HOP_GROUP.NOTICE] ./src/mlnx_sai_nexthopgroup.c[4616]- mlnx_remove_next_hop_group_members: Remove NEXT_HOP_GROUP_MEMBER [OID:0x1490000002D] [ID:5]
Dec  4 07:27:41.799236 sonic NOTICE syncd#SDK: [SAI_NEXT_HOP_GROUP.NOTICE] ./src/mlnx_sai_nexthopgroup.c[1211]- mlnx_remove_next_hop_group: Remove NEXT_HOP_GROUP [OID:0xC800000005] [ID:3]
Dec  4 07:27:41.804102 sonic INFO swss#supervisord: orchagent 

Therefore, the test should check multiple indicators to ensure the system has reached a truly stable state. Alternatively, the test could be toggling the BGP neighbor instead of the ports, as that BGP and route entry convergence could be a valid indicator.

Steps to reproduce the issue: 1. 2. 3.

Describe the results you received:

Describe the results you expected:

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
bingwang-ms commented 9 months ago

@keboliu I didn't see similar issue in internal nightly test. The test is consistently passing. On which platform did you see the test failure? @StormLiangMS FYI.

bingwang-ms commented 9 months ago

@keboliu What's the failure message if the issue happened?

roy-sror commented 6 months ago

@bingwang-ms - The test would fail with the following exception in case of failure: Failed: Redis Memory Increase more than expected: 43.527430221366686

roy-sror commented 5 months ago

@bingwang-ms - can you please help in prioritizing this issue? the test is currently skipped on this bug.

bingwang-ms commented 3 months ago

Hi @roy-sror, I double checked the testing record of this particular test case. I only see two similar failures in the last week. The failure was because the increment of redis memory usage is slightly above the threshold (5%).

SavchukRomanLv commented 2 months ago

Hi @bingwang-ms we do not see now 43% increase memory, we do see 6-7% as you do. Test is flaky

bingwang-ms commented 2 months ago

@keboliu Is there a way to check if the routes are programmed into ASIC? Changing the threshold is easy, but that's not the best solution.

keboliu commented 2 months ago

@keboliu Is there a way to check if the routes are programmed into ASIC? Changing the threshold is easy, but that's not the best solution.

Can we check all the route entries in the ASIC_DB?

bingwang-ms commented 1 month ago

I don't think it's good enough because even if routes are in ASIC_DB, that doesn't mean the routes are programmed to ASIC. Since we see 6% - 7% memory utilization usage increment, how about changing the threshold to 8% to reduce flakiness? @StormLiangMS Can you please comment?

bingwang-ms commented 1 month ago

@keboliu Will this issue be addressed by https://github.com/sonic-net/sonic-mgmt/pull/13066?