sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
196 stars 718 forks source link

[Bug]: Tests taking long time to run, number of testcases multiplied in some cases #13367

Closed vperumal closed 3 months ago

vperumal commented 4 months ago

Issue Description

Tests are taking long time to run in master and in many cases number of testcases have multiplied.

In case of CACL - time has increased from 1:26 mins to 7:14 mins. Number of testcases have jumped from 45 to 321. Similarly for CRM it has jumped from 87 to 284. It looks very similar to fixtures issue that we had hit before during acl run. Notice the DUT combinations between both runs - From Per DUT to DUT1-DUT2, DUT0-DUT1 etc combinations. As the number of DUTs increase these combinations will increase.

For e.g Master run: PASSED cacl/test_cacl_application.py::test_cacl_scale_rules_ipv6[aaa14-lc1-aaa14-lc1] PASSED cacl/test_cacl_application.py::test_cacl_scale_rules_ipv4[aaa14-lc1-aaa14-lc0] PASSED cacl/test_cacl_application.py::test_cacl_scale_rules_ipv6[aaa14-lc1-aaa14-lc0] PASSED cacl/test_cacl_application.py::test_cacl_scale_rules_ipv4[aaa14-lc0-aaa14-lc0] PASSED cacl/test_cacl_application.py::test_cacl_scale_rules_ipv6[aaa14-lc0-aaa14-lc0] ===== 40 failed, 136 passed, 144 skipped, 1 warning in 26047.29s (7:14:07) =====

For 202305 run: PASSED cacl/test_cacl_application.py::test_cacl_application_nondualtor[sfd-t2-lc2-2] PASSED cacl/test_cacl_application.py::test_multiasic_cacl_application[sfd-t2-lc2-2] PASSED cacl/test_cacl_application.py::test_cacl_application_nondualtor[sfd-t2-sup-None] PASSED cacl/test_cacl_application.py::test_multiasic_cacl_application[sfd-t2-sup-None] ============ 26 passed, 18 skipped, 1 warning in 5193.68s (1:26:33) ============

Results you see

Increase time and testcase counts

Results you expected to see

Reduce time and testcase count

Is it platform specific

generic

Relevant log output

No

Output of show version

SONiC Software Version: SONiC.202405.12486-dirty-20240613.151806
SONiC OS Version: 12
Distribution: Debian 12.5
Kernel: 6.1.0-11-2-amd64
Build commit: 2a8362fac
Build date: Fri Jun 14 03:47:06 UTC 2024
Built by: sonicci@sonic-ci-12-lnx.cisco.com

Attach files (if any)

No response

vperumal commented 4 months ago

FYI @abdosi

vperumal commented 4 months ago

@liuh-80 - Issue is because of https://github.com/sonic-net/sonic-mgmt/pull/12433. If I comment out from tests.common.fixtures.tacacs import tacacs_creds, setup_tacacs # noqa F401 from the cacl

The behavior reverts back to the original state - cacl/test_cacl_application.py::test_cacl_application_nondualtor[aaa14-lc0-0] instead of PASSED cacl/test_cacl_application.py::test_cacl_scale_rules_ipv6[aaa14-lc1-aaa14-lc1]

Can you please take a look

abdosi commented 4 months ago

@yejianquan for viz

abdosi commented 4 months ago

@arlakshm @judyjoseph for viz.

yejianquan commented 4 months ago

@yejianquan for viz

Thanks for the notice. I did some quick check, this Tacacs change has been cherry-picked into internal-202405, in our last nightly run, we were with it. Bue it doesn't seem to affect our internal test. @cyw233 to double confirm, since he's working on the test module long running issue. image

I suspect it's because in Perumal's env, the tacacs server is not ready so that it keeps waiting? @liuh-80 to double confirm

vperumal commented 4 months ago

Hi @yejianquan , I thought the same so had setup a tacacs server and tried it on our end. But it is not due to tacacs server. The tacacs server fixture is pretty simple, if it is not setup the fixture setups a tacacs server on ptf and continues forward. The issue is seen only on a setup with multiple duts, Not sure the run that you have above has multiple duts or not. When you have multiple duts, it creates combinations for one dut with the other and then the testcases multiplies hence increasing the time it takes for running the test.

If you notice above, with the PR #12433 - For t2 profile tests are repeated with lc0--lc0, lc0_lc1, lc0_lc2 .... so on and so forth, instead of just running on lc0, lc1, lc2. hence number of testcases jump from 45 to 321 for CACL.

cyw233 commented 4 months ago

Hi @yejianquan , I thought the same so had setup a tacacs server and tried it on our end. But it is not due to tacacs server. The tacacs server fixture is pretty simple, if it is not setup the fixture setups a tacacs server on ptf and continues forward. The issue is seen only on a setup with multiple duts, Not sure the run that you have above has multiple duts or not. When you have multiple duts, it creates combinations for one dut with the other and then the testcases multiplies hence increasing the time it takes for running the test.

If you notice above, with the PR #12433 - For t2 profile tests are repeated with lc0--lc0, lc0_lc1, lc0_lc2 .... so on and so forth, instead of just running on lc0, lc1, lc2. hence number of testcases jump from 45 to 321 for CACL.

Checking

cyw233 commented 4 months ago

Hi @vperumal, I did some quick investigation and found that we might be running metafunc.parametrize(..., duts_selected, ...) twice for each test case in cacl/test_cacl_application.py. The first parametrization comes from this line and the second parametrization comes from this line. This is because in our case

I will work with @liuh-80 and the team next week to fix this. Cheers!

vperumal commented 4 months ago

Thanks for taking a look @cyw233. Just to clarify, this affects cacl and many other scripts like arp, crm etc

cyw233 commented 3 months ago

HI @vperumal, we have a PR to fix this: https://github.com/sonic-net/sonic-mgmt/pull/13422

yejianquan commented 3 months ago

Reverted the tacacs change, closing...