sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
200 stars 730 forks source link

[T2] Various Tests failed because of routeCheck during router events such as flapping/reload etc. #10147

Open ysmanman opened 1 year ago

ysmanman commented 1 year ago

Description

We noticed this in recent 202205 T2 testing. Tests failed because routeCheck failed. This failure was seen in pre-test sanity check and also loganalyzer check. For example:

Failure in pre-test sanity check:

         E                       Failed: !!!!!!!!!!!!!!!! Pre-test sanity check after recovery failed: !!!!!!!!!!!!!!!!
         E                       [
         E                           {
         E                               "check_item": "monit",
         E                               "failed": true,
         E                               "host": "cmp227-5",
         E                               "services_status": {
         E                                   "container_memory_telemetry": "Status ok",
         E                                   "cmp227-5": "Running",
         E                                   "dualtorNeighborCheck": "Status ok",
         E                                   "root-overlay": "Accessible",
         E                                   "container_checker": "Status ok",
         E                                   "vnetRouteCheck": "Status ok",
         E                                   "var-log": "Accessible",
         E                                   "rsyslog": "Running",
         E                                   "memory_check": "Status ok",
         E                                   "container_memory_snmp": "Status ok",
         E                                   "diskCheck": "Status ok",
         E                                   "routeCheck": "Status failed"
         E                               }
         E                           }
         E                       ]

Failure in loganalyzer:

         E               Failed: Processes "['analyze_logs--<MultiAsicSonicHost cmp227-5>']" failed with exit code "1"
         E               Exception:
         E               expected_match: 0
         E               expected_missing_match: 0
         E               match: 1
         E
         E               Match Messages:
         E               Sep 24 06:30:08.534487 cmp227-5 ERR monit[286219]: 'routeCheck' status failed (255) -- Failure results: {{#012    "missed_ROUTE_TABLE_routes": [#012        "10.0.0.133/32",#012        "100.1.0.67/32",#012        "2064:100::43/128",#012        "\
fc00::10a/128"#012    ]#012}}#012Failed. Look at reported mismatches above#012add: []#012del: []
         E
         E               Traceback:
         E               Traceback (most recent call last):
         E                 File "/data/tests/common/helpers/parallel.py", line 31, in run
         E                   Process.run(self)
         E                 File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
         E                   self._target(*self._args, **self._kwargs)
         E                 File "/data/tests/common/helpers/parallel.py", line 243, in wrapper
         E                   target(*args, **kwargs)
         E                 File "/data/tests/common/plugins/loganalyzer/__init__.py", line 39, in analyze_logs
         E                   dut_analyzer.analyze(markers[node.hostname])
         E                 File "/data/tests/common/plugins/loganalyzer/loganalyzer.py", line 373, in analyze
         E                   self._verify_log(analyzer_summary)
         E                 File "/data/tests/common/plugins/loganalyzer/loganalyzer.py", line 136, in _verify_log
         E                   raise LogAnalyzerError(result_str)
         E               LogAnalyzerError: expected_match: 0
         E               expected_missing_match: 0
         E               match: 1
         E
         E               Match Messages:
         E               Sep 24 06:30:08.534487 cmp227-5 ERR monit[286219]: 'routeCheck' status failed (255) -- Failure results: {{#012    "missed_ROUTE_TABLE_routes": [#012        "10.0.0.133/32",#012        "100.1.0.67/32",#012        "2064:100::43/128",#012        "\
fc00::10a/128"#012    ]#012}}#012Failed. Look at reported mismatches above#012add: []#012del: []

The above failure was seen in various tests, like pc, platform_tests, tacacs, route, pfcwd, voq, qos.

After all tests were done, we checked the output of monit status on all LCs, and routeCheck passed. So the issue seem eventually got recoverd/fixed.

Steps to reproduce the issue: 1. 2. 3.

Describe the results you received:

Describe the results you expected:

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
ysmanman commented 1 year ago

Add @kenneth-arista @arlakshm for visibility.

arlakshm commented 1 year ago

Hi @ysmanman, do you still see this issue?

arlakshm commented 12 months ago

This issue is still, but need more triage to find a way to reproduce this issue.

veronica-arista commented 10 months ago

We see routeCheck failures intermittently in various tests, typically from LogAlalyzer during the test teardown. It is difficult to debug the causes in each case, because they are hard to intentionally reproduce so we can check the syslog and the switch state for the cause.



In some cases, the test does something that would cause routeCheck fails, such as flapping interfaces. E.g. tests/pc/test_lag_2.py which explicitly adds a loganalyzer ignore regex for missing routes:

loganalyzer[rand_one_dut_hostname].ignore_regex.extend([".*missed_ROUTE_TABLE_routes.*"])

However, we are still seeing fails in this test due to missed_ROUTE_TABLE_routes logs found by LogAnalyzer. I’m not sure why this is not working as intended. Additionally, we see related fails due to Unaccounted_ROUTE_ENTRY_TABLE_entries which is not currently included in the ignore regex.

In other tests, the routeCheck fails are much less frequent and possibly due to bad timing. E.g. we observed a routeCheck error in tests/voq/test_voq_chassis_app_db_consistency.py which was due to a routeCheck running right after the reboot. This test does add and delete a temp portchannel, but does not currently have any loganalyzer ignore regex for routeCheck fails.

Would it make sense to blanket ignore these routeCheck fails in a wider set of tests? Alternatively, could we disable routeCheck altogether during testing, since many tests perform reboots and/or interface flapping?

rlhui commented 10 months ago

We see routeCheck failures intermittently in various tests, typically from LogAlalyzer during the test teardown. It is difficult to debug the causes in each case, because they are hard to intentionally reproduce so we can check the syslog and the switch state for the cause.



In some cases, the test does something that would cause routeCheck fails, such as flapping interfaces. E.g. tests/pc/test_lag_2.py which explicitly adds a loganalyzer ignore regex for missing routes:

loganalyzer[rand_one_dut_hostname].ignore_regex.extend([".*missed_ROUTE_TABLE_routes.*"])

However, we are still seeing fails in this test due to missed_ROUTE_TABLE_routes logs found by LogAnalyzer. I’m not sure why this is not working as intended. Additionally, we see related fails due to Unaccounted_ROUTE_ENTRY_TABLE_entries which is not currently included in the ignore regex.

In other tests, the routeCheck fails are much less frequent and possibly due to bad timing. E.g. we observed a routeCheck error in tests/voq/test_voq_chassis_app_db_consistency.py which was due to a routeCheck running right after the reboot. This test does add and delete a temp portchannel, but does not currently have any loganalyzer ignore regex for routeCheck fails.

Would it make sense to blanket ignore these routeCheck fails in a wider set of tests? Alternatively, could we disable routeCheck altogether during testing, since many tests perform reboots and/or interface flapping?

@yxieca, @wangxin please review this and comment?

deepak-singhal0408 commented 10 months ago

Currently monit check for route generates syslog/alerts every 5 cycle (5 mins) if the routecheck fails for 3 or more cycles (each cycle is 1 min)