Open ysmanman opened 1 year ago
Add @kenneth-arista @arlakshm for visibility.
Hi @ysmanman, do you still see this issue?
This issue is still, but need more triage to find a way to reproduce this issue.
We see routeCheck failures intermittently in various tests, typically from LogAlalyzer during the test teardown. It is difficult to debug the causes in each case, because they are hard to intentionally reproduce so we can check the syslog and the switch state for the cause.
In some cases, the test does something that would cause routeCheck fails, such as flapping interfaces. E.g. tests/pc/test_lag_2.py
which explicitly adds a loganalyzer ignore regex for missing routes:
loganalyzer[rand_one_dut_hostname].ignore_regex.extend([".*missed_ROUTE_TABLE_routes.*"])
However, we are still seeing fails in this test due to missed_ROUTE_TABLE_routes
logs found by LogAnalyzer. I’m not sure why this is not working as intended. Additionally, we see related fails due to Unaccounted_ROUTE_ENTRY_TABLE_entries
which is not currently included in the ignore regex.
In other tests, the routeCheck fails are much less frequent and possibly due to bad timing. E.g. we observed a routeCheck error in tests/voq/test_voq_chassis_app_db_consistency.py which was due to a routeCheck running right after the reboot. This test does add and delete a temp portchannel, but does not currently have any loganalyzer ignore regex for routeCheck fails.
Would it make sense to blanket ignore these routeCheck fails in a wider set of tests? Alternatively, could we disable routeCheck altogether during testing, since many tests perform reboots and/or interface flapping?
We see routeCheck failures intermittently in various tests, typically from LogAlalyzer during the test teardown. It is difficult to debug the causes in each case, because they are hard to intentionally reproduce so we can check the syslog and the switch state for the cause.
In some cases, the test does something that would cause routeCheck fails, such as flapping interfaces. E.g.
tests/pc/test_lag_2.py
which explicitly adds a loganalyzer ignore regex for missing routes:loganalyzer[rand_one_dut_hostname].ignore_regex.extend([".*missed_ROUTE_TABLE_routes.*"])
However, we are still seeing fails in this test due to
missed_ROUTE_TABLE_routes
logs found by LogAnalyzer. I’m not sure why this is not working as intended. Additionally, we see related fails due toUnaccounted_ROUTE_ENTRY_TABLE_entries
which is not currently included in the ignore regex.In other tests, the routeCheck fails are much less frequent and possibly due to bad timing. E.g. we observed a routeCheck error in tests/voq/test_voq_chassis_app_db_consistency.py which was due to a routeCheck running right after the reboot. This test does add and delete a temp portchannel, but does not currently have any loganalyzer ignore regex for routeCheck fails.
Would it make sense to blanket ignore these routeCheck fails in a wider set of tests? Alternatively, could we disable routeCheck altogether during testing, since many tests perform reboots and/or interface flapping?
@yxieca, @wangxin please review this and comment?
Currently monit check for route generates syslog/alerts every 5 cycle (5 mins) if the routecheck fails for 3 or more cycles (each cycle is 1 min)
Description
We noticed this in recent 202205 T2 testing. Tests failed because routeCheck failed. This failure was seen in pre-test sanity check and also loganalyzer check. For example:
Failure in pre-test sanity check:
Failure in loganalyzer:
The above failure was seen in various tests, like pc, platform_tests, tacacs, route, pfcwd, voq, qos.
After all tests were done, we checked the output of
monit status
on all LCs, androuteCheck
passed. So the issue seem eventually got recoverd/fixed.Steps to reproduce the issue: 1. 2. 3.
Describe the results you received:
Describe the results you expected:
Additional information you deem important: