sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
201 stars 728 forks source link

[Arista] crm/test_crm.py test failure #6514

Open yaqiangz opened 2 years ago

yaqiangz commented 2 years ago

Description Test failures seen in 720dt.

Steps to reproduce the issue:

  1. Run test crm/test_crm.py
  2. Manually run command to add route and verify crm counter.

Describe the results you received:

  1. test results 1.1 crm/test_crm.py::test_crm_route failed

              # Verify "crm_stats_ipv[4/6]_route_used" counter was incremented
              if not (new_crm_stats_route_used - crm_stats_route_used == total_routes):
                  for i in range(total_routes):
                      RESTORE_CMDS["test_crm_route"].append(route_del_cmd.format(asichost.ip_cmd, i, nh_ip))
                  pytest.fail("\"crm_stats_ipv{}_route_used\" counter was not incremented".format(ip_ver))
              # Verify "crm_stats_ipv[4/6]_route_available" counter was decremented
              if not (crm_stats_route_available - new_crm_stats_route_available >= 1):
                  for i in range(total_routes):
                      RESTORE_CMDS["test_crm_route"].append(route_del_cmd.format(asichost.ip_cmd, i, nh_ip))
      >           pytest.fail("\"crm_stats_ipv{}_route_available\" counter was not decremented".format(ip_ver))
      E           Failed: "crm_stats_ipv6_route_available" counter was not decremented
    
      add_routes_template = <Template memory:7fbe1fc9e6d0>
      add_template = "\n    \n    {% set ns_prefix = '' %}\n    {% set ns_option = '-n '%}\n    {% if namespace %}\n    {% set ns_prefix = ...s_list}}\n    do\n        ip {{ns_prefix}} route add ${s} dev {{interface}}\n        echo added route - ${s}\n    done"
      asichost   = <SonicAsic 0>
      cmd        = 'sudo ip  -6 neigh show dev PortChannel101 nud reachable nud stale             | grep -v fe80'
      crm_interface = (u'PortChannel101', u'PortChannel103')
      crm_stats_route_available = 1115
      crm_stats_route_used = 954
      del_routes_template = <Template memory:7fbe240dd750>
      del_template = "\n    \n    {% set ns_prefix = '' %}\n    {% set ns_option = '-n '%}\n    {% if namespace %}\n    {% set ns_prefix = ...list}}\n    do\n        ip {{ns_prefix}} route del ${s} dev {{interface}}\n        echo deleted route - ${s}\n    done"
      duthost    = <MultiAsicSonicHost bjw-can-720dt-1>
      duthosts   = [<MultiAsicSonicHost bjw-can-720dt-1>]
      enum_frontend_asic_index = None
      enum_rand_one_per_hwsku_frontend_hostname = 'bjw-can-720dt-1'
      get_route_stats = 'sonic-db-cli  COUNTERS_DB HMGET                             CRM:STATS crm_stats_ipv6_route_used
           crm_stats_ipv6_route_available'
      i          = 0
      ip_ver     = '6'
      item       = 'fc00::72 lladdr 26:b6:d8:18:1e:f0 router REACHABLE'
      new_crm_stats_route_available = 1115
      new_crm_stats_route_used = 955
      nh_ip      = 'fc00::72'
      out        = {'stderr_lines': [], u'cmd': u'sudo ip  -6 neigh show dev PortChannel101 nud r...'become_method', and 'become_user' rather than running sudo"], 'failed': False}
      route_add  = 'sudo ip  -6 route add 2001:0::/126 via fc00::72'
      route_add_cmd = '{} -6 route add 2001:{}::/126 via {}'
      route_del_cmd = '{} -6 route del 2001:{}::/126 via {}'
      total_routes = 1
    
      crm/test_crm.py:504: Failed

    1.2 crm/test_crm.py::test_crm_nexthop_group failed

    
    self = <tests.common.plugins.loganalyzer.loganalyzer.LogAnalyzer instance at 0x7fa005d9bb40>
    result = {'expect_messages': {'/tmp/syslog.bjw-can-720dt-1.2022-10-11-02:37:27': []}, 'match_files': {'/tmp/syslog.bjw-can-720d...jw-can-720dt-1.2022-10-11-02:37:27': []}, 'total': {'expected_match': 0, 'expected_missing_match': 1, 'match': 0}, ...}
    def _verify_log(self, result):
    """
    Verify that total match and expected missing match equals to zero or raise exception otherwise.
    Verify that expected_match is not equal to zero when there is configured expected regexp in self.expect_regex list
    """
    if not result:
            raise LogAnalyzerError("Log analyzer failed - no result.")
        else:
            result_str = self._results_repr(result)
            if result["total"]["match"] != 0 or result["total"]["expected_missing_match"] != 0:
    >               raise LogAnalyzerError(result_str)
    E               LogAnalyzerError: expected_match: 0
    E               expected_missing_match: 1
    E               match: 0
    E
    E               Expected Messages that are missing:
    E               .* THRESHOLD_EXCEEDED .*

result = {'expect_messages': {'/tmp/syslog.bjw-can-720dt-1.2022-10-11-02:37:27': []}, 'match_files': {'/tmp/syslog.bjw-can-720d...jw-can-720dt-1.2022-10-11-02:37:27': []}, 'total': {'expected_match': 0, 'expected_missing_match': 1, 'match': 0}, ...} result_str = 'expected_match: 0\nexpected_missing_match: 1\nmatch: 0\n\nExpected Messages that are missing:\n. THRESHOLD_EXCEEDED .' self = <tests.common.plugins.loganalyzer.loganalyzer.LogAnalyzer instance at 0x7fa005d9bb40>

2. As shown below, after adding IPv6 route, "crm_stats_ipv6_route_used" counter was incremented, but "crm_stats_ipv6_route_available" counter **was not decremented** (even if wait more than 1 minute)
![image](https://user-images.githubusercontent.com/26435361/194991208-8faa3154-ece6-486a-8a73-926b53bbdc29.png)

**Describe the results you expected:**
1. crm/test_crm.py passed
2. after adding IPv6 route, "crm_stats_ipv6_route_used" counter was incremented and "crm_stats_ipv6_route_available" counter was decremented

**Additional information you deem important:**
<!--
software/ASIC/Hardware Flatform version and info
-->
    **Output of `show version`:**
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
Blueve commented 2 years ago

For test_crm_nexthop_group, I observe below log errs:

E               Oct 18 14:30:52.643831 bjw-can-720dt-1 ERR syncd#syncd: [none] SAI_API_NEXT_HOP_GROUP:brcm_sai_xgs_nexthop_group_create:73 ecmp nh group create failed with error Table full (0xfffffffa).
E               
E               Oct 18 14:30:52.643831 bjw-can-720dt-1 ERR syncd#syncd: [none] SAI_API_NEXT_HOP_GROUP:brcm_sai_create_next_hop_group:147 pd nexthop group create failed failed with error -13.
E               
E               Oct 18 14:30:52.643831 bjw-can-720dt-1 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_TABLE_FULL
E               
E               Oct 18 14:30:52.643991 bjw-can-720dt-1 ERR syncd#syncd: :- processQuadEvent: attr: SAI_NEXT_HOP_GROUP_ATTR_TYPE: SAI_NEXT_HOP_GROUP_TYPE_DYNAMIC_UNORDERED_ECMP
E               
E               Oct 18 14:30:52.644412 bjw-can-720dt-1 ERR swss#orchagent: :- create: create status: SAI_STATUS_TABLE_FULL
E               
E               Oct 18 14:30:52.644412 bjw-can-720dt-1 ERR swss#orchagent: :- addNextHopGroup: Failed to create next hop group 2.0.0.1@PortChannel101,2.0.0.130@PortChannel101, rv:-13
E               
E               Oct 18 14:30:52.644412 bjw-can-720dt-1 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP_GROUP, status: SAI_STATUS_TABLE_FULL
E               
E               Oct 18 14:31:54.200837 bjw-can-720dt-1 ERR swss#supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (1.0 minutes).
E               
E               Oct 18 14:32:54.266360 bjw-can-720dt-1 ERR swss#supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (2.0 minutes).
Blueve commented 2 years ago

For test_crm_fdb_entry, I observed below logs:

E               Oct 18 14:34:54.391196 bjw-can-720dt-1 ERR swss#supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (4.0 minutes).
E               
E               Oct 18 14:35:54.456514 bjw-can-720dt-1 ERR swss#supervisor-proc-exit-listener: Process 'orchagent' is not running in namespace 'host' (5.0 minutes).
andywongarista commented 2 years ago

The test_crm_route is likely a SAI issue as the available counter comes from a SAI call. For test_crm_nexthop_group, it seems that the test will try to create more nexthop groups than the BCM sdk can handle, hence the "Table full" error that causes orchagent to crash.