sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
723 stars 1.38k forks source link

[voq][Chassis] test_hash fails intermittently on a chassis with J2C+ linecards #13308

Open mannytaheri opened 1 year ago

mannytaheri commented 1 year ago

Description

sonic-mgmt FIB tests test_hash[ipv4] or/and test_hash[ipv6] intermittenlty fails on a chassis with J2C+ (DNX) linecards. This has been seen on both 100G or 400G Nokia linecards running in both T2 min and T2 topology

Steps to reproduce the issue:

  1. Run sonic-mgmt test tests/fib/test_hash.py against a T2 chassis with J2C+ linecards

Describe the results you received:

Test fails as the traffic is not being hashed evenly across ECMP routes as well as across the ports in a LAG.

Below is an exmaple of the failure exception seen in PTF:

        "/root/env-python3/bin/ptf:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses", 
        "  import imp", 
        "/root/env-python3/lib/python3.7/site-packages/scapy/layers/ipsec.py:471: CryptographyDeprecationWarning: Blowfish has been deprecated", 
        "  cipher=algorithms.Blowfish,", 
        "/root/env-python3/lib/python3.7/site-packages/scapy/layers/ipsec.py:485: CryptographyDeprecationWarning: CAST5 has been deprecated", 
        "  cipher=algorithms.CAST5,", 
        "hash_test.HashTest ... FAIL", 
        "", 
        "======================================================================", 
        "FAIL: hash_test.HashTest", 
        "----------------------------------------------------------------------", 
        "Traceback (most recent call last):", 
        "  File \"ptftests/py3/hash_test.py\", line 464, in runTest", 
        "    self.check_hash(hash_key)", 
        "  File \"ptftests/py3/hash_test.py\", line 175, in check_hash", 
        "    self.check_balancing(next_hop.get_next_hop(), hit_count_map)", 
        "  File \"ptftests/py3/hash_test.py\", line 454, in check_balancing", 
        "    assert result", 
        "AssertionError", 
        "", 
        "----------------------------------------------------------------------", 
        "Ran 1 test in 181.168s", 
        "", 
        "FAILED (failures=1)"

Below is the distribution of traffic on one of the runs against T2 topology:

15:23:30.582  root      : INFO    : type         port(s)            exp_cnt         act_cnt         diff(%)
15:23:30.582  root      : INFO    : ECMP         [0, 1]                 865             938           8.39%
15:23:30.582  root      : INFO    : LAG          0                      469             352         -24.95%
15:23:30.582  root      : INFO    : LAG          1                      469             586          24.95%
15:23:30.583  root      : INFO    : ECMP         [2, 3]                 865             743         -14.14%
15:23:30.583  root      : INFO    : LAG          2                      371             439          18.17%
15:23:30.583  root      : INFO    : LAG          3                      371             304         -18.17%
15:23:30.583  root      : INFO    : ECMP         [4, 5]                 865             736         -14.95%
15:23:30.583  root      : INFO    : LAG          4                      368             350          -4.89%
15:23:30.583  root      : INFO    : LAG          5                      368             386           4.89%
15:23:30.583  root      : INFO    : ECMP         [6, 7]                 865             887            2.5%
15:23:30.583  root      : INFO    : LAG          6                      443             442      -0.33999999999999997%
15:23:30.583  root      : INFO    : LAG          7                      443             445      0.33999999999999997%
15:23:30.583  root      : INFO    : ECMP         [8, 9]                 865             823           -4.9%
15:23:30.583  root      : INFO    : LAG          8                      411             442           7.41%
15:23:30.583  root      : INFO    : LAG          9                      411             381          -7.41%
15:23:30.583  root      : INFO    : ECMP         [10]                   865             810           -6.4%
15:23:30.583  root      : INFO    : ECMP         [11]                   865             918           6.08%
15:23:30.583  root      : INFO    : ECMP         [12]                   865             838          -3.16%
15:23:30.583  root      : INFO    : ECMP         [13]                   865             843          -2.59%
15:23:30.583  root      : INFO    : ECMP         [14]                   865             819          -5.36%
15:23:30.583  root      : INFO    : ECMP         [15]                   865             893           3.19%
15:23:30.583  root      : INFO    : ECMP         [16]                   865             926      7.000000000000001%
15:23:30.583  root      : INFO    : ECMP         [17]                   865             890      2.8400000000000003%
15:23:30.583  root      : INFO    : ECMP         [18, 19]               865             918           6.08%
15:23:30.583  root      : INFO    : LAG          18                     459             467      1.7399999999999998%
15:23:30.583  root      : INFO    : LAG          19                     459             451      -1.7399999999999998%
15:23:30.583  root      : INFO    : ECMP         [20, 21]               865            1049          21.22%
15:23:30.583  root      : INFO    : LAG          20                     524             478      -8.870000000000001%
15:23:30.583  root      : INFO    : LAG          21                     524             571      8.870000000000001%
15:23:30.583  root      : INFO    : ECMP         [22, 23]               865             939           8.51%
15:23:30.583  root      : INFO    : LAG          22                     469             458          -2.45%
15:23:30.583  root      : INFO    : LAG          23                     469             481           2.45%
15:23:30.583  root      : INFO    : ECMP         [24, 25]               865             912      5.390000000000001%
15:23:30.583  root      : INFO    : LAG          24                     456             377         -17.32%
15:23:30.583  root      : INFO    : LAG          25                     456             535          17.32%
15:23:30.583  root      : INFO    : ECMP         [26, 27]               865             892           3.08%
15:23:30.583  root      : INFO    : LAG          26                     446             439      -1.5699999999999998%
15:23:30.583  root      : INFO    : LAG          27                     446             453      1.5699999999999998%
15:23:30.583  root      : INFO    : ECMP         [28]                   865             988          14.17%
15:23:30.583  root      : INFO    : ECMP         [29]                   865             875           1.11%
15:23:30.583  root      : INFO    : ECMP         [30]                   865             780          -9.87%
15:23:30.583  root      : INFO    : ECMP         [31]                   865             818          -5.48%
15:23:30.583  root      : INFO    : ECMP         [32]                   865             893           3.19%
15:23:30.584  root      : INFO    : ECMP         [33]                   865             671         -22.46%
15:23:30.584  root      : INFO    : ECMP         [34]                   865             960          10.93%
15:23:30.584  root      : INFO    : ECMP         [35]                   865             741         -14.37%

Describe the results you expected:

Test test_hash.py should always pass with traffic distributed evenly across the LAG and ECMP routes.

Output of show version:

Latest sonic-buildimage based off of 202205 branch.

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

rlhui commented 1 year ago

Sandeep , please open a brcm case to double-check on the load-balancing in asic. Thanks.

sanmalho-git commented 1 year ago

CS00012280800 opened for tracking this issue with BRCM

sanjair-git commented 1 year ago