Open keboliu opened 9 months ago
@keboliu I didn't see similar issue in internal nightly test. The test is consistently passing. On which platform did you see the test failure? @StormLiangMS FYI.
@keboliu What's the failure message if the issue happened?
@bingwang-ms - The test would fail with the following exception in case of failure:
Failed: Redis Memory Increase more than expected: 43.527430221366686
@bingwang-ms - can you please help in prioritizing this issue? the test is currently skipped on this bug.
Hi @roy-sror, I double checked the testing record of this particular test case. I only see two similar failures in the last week. The failure was because the increment of redis memory usage is slightly above the threshold (5%).
Failed: Redis Memory Increase more than expected: 5.306888233717493
Failed: Redis Memory Increase more than expected: 5.531570426335664
So what's the failure rate at your side? And how big is the memory increment?
Hi @bingwang-ms we do not see now 43% increase memory, we do see 6-7% as you do. Test is flaky
@keboliu Is there a way to check if the routes are programmed into ASIC? Changing the threshold is easy, but that's not the best solution.
@keboliu Is there a way to check if the routes are programmed into ASIC? Changing the threshold is easy, but that's not the best solution.
Can we check all the route entries in the ASIC_DB?
I don't think it's good enough because even if routes are in ASIC_DB, that doesn't mean the routes are programmed to ASIC. Since we see 6% - 7% memory utilization usage increment, how about changing the threshold to 8% to reduce flakiness? @StormLiangMS Can you please comment?
@keboliu Will this issue be addressed by https://github.com/sonic-net/sonic-mgmt/pull/13066?
Description
This test case assumes that after all route entries are re-learned and BGP has converged, the system has reached a stable status. To measure this, the test will compare memory and CPU usage before and after the test starts, to determine whether the test passes. However, simply measuring BGP convergence may not be enough, as other protocols and daemons also need to stabilize after port flapping. For example, on one system, even though BGP had converged, it hadn't finished writing route entries to the ASIC, indicating it was not yet stable.
test case log indicates that BGP converged:
switch syslog shows that it still working on writing route entries:
Therefore, the test should check multiple indicators to ensure the system has reached a truly stable state. Alternatively, the test could be toggling the BGP neighbor instead of the ports, as that BGP and route entry convergence could be a valid indicator.
Steps to reproduce the issue: 1. 2. 3.
Describe the results you received:
Describe the results you expected:
Additional information you deem important: