sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
194 stars 713 forks source link

ipfwd/test_nhop_count.py fails on T1-64-lag #4739

Open rraguraj opened 2 years ago

rraguraj commented 2 years ago

In test script nhop_group_limit is set tp 1024

In cisco platform, available nexthop_group is more that 1024 as shown below

Resource Name           Used Count    Available Count

--------------------  ------------  -----------------

nexthop_group                    4              **16380**

After utilizing the routes

Resource Name           Used Count    Available Count

--------------------  ------------  -----------------

nexthop_group                 1038              **15346**

Test expects crm available next hop routes to 0(to utillize all nhop available routes). However test uses only 1034 nhop routes. so test failed.

verify the test used up all the NHOP group resources

    # skip this check on Mellanox as ASIC resources are shared

    if not is_mellanox_device(duthost):

        pytest_assert(

            crm_after["available"] == 0,

            "Unused NHOP group resource: {}, used:{}".format(
              crm_after["available"], crm_after["used"]
            )

E Failed: Unused NHOP group resource: 15346, used:1038

Please let us know Why the nhop_group_limit is restricted to 1024, Since other platforms can handle much more nexthop_groups

sachinv-msft commented 2 years ago

@yxieca any help on who can take this issue for review?

yxieca commented 2 years ago

@prsunny can you help?

sachinv-msft commented 2 years ago

@prsunny could you please help on this issue?

sachinv-msft commented 2 years ago

@prsunny could you please help on this issue?

prsunny commented 2 years ago

agree that this would fail for platforms that doesn't return MAX_COUNT

max_nhop = switch_capability.get("MAX_NEXTHOP_GROUP_COUNT")

Even if this returns, i dont think we can test by such high number. Request @smaheshm, original author to comment on this.

smaheshm commented 2 years ago

Ignore my previous comment, misunderstood the issue. Some platforms don't have the 'MAX_NEXTHOP_GROUP_COUNT', so the limit is set to 1k as we are aware they do not support more than that.

the script can create > 16k nhop groups. Have you tried with 16k NHOP groups.

If orchagent can handle it I don't see any issue, but the script is run for older releases too where the code is different.

I think we can add platform specific check and set the max nhop group accordingly.

Once all platfomrs support "MAX_NEXTHOP_GROUP_COUNT", this check can be removed.

smaheshm commented 2 years ago

https://github.com/Azure/sonic-mgmt/blob/c3bcea879a5e06543ac7fdb18642f3a1c3ce3069/tests/ipfwd/test_nhop_count.py#L289

we can get rid of this variable: nhop_group_count and use "max_nhop + extra_nhops"

balajib-cisco commented 2 years ago

Maybe, we are mixing things a bit here. switch_capability.get("MAX_NEXTHOP_GROUP_COUNT") querying SAI attribute SAI_SWITCH_ATTR_NUMBER_OF_ECMP_GROUPS and CRM nexthop_group querying SAI_SWITCH_ATTR_AVAILABLE_NEXT_HOP_GROUP_ENTRY. Can't we just get the nexthop_group available count directly from CRM show CLI output and use it?

smaheshm commented 2 years ago

Maybe, we are mixing things a bit here. switch_capability.get("MAX_NEXTHOP_GROUP_COUNT") querying SAI attribute SAI_SWITCH_ATTR_NUMBER_OF_ECMP_GROUPS and CRM nexthop_group querying SAI_SWITCH_ATTR_AVAILABLE_NEXT_HOP_GROUP_ENTRY. Can't we just get the nexthop_group available count directly from CRM show CLI output and use it?

It doesn't work with all platforms since some platforms share resources. e.g. Mellanox for which the check for CRM NHOP resource exhaustion is skipped.

balajib-cisco commented 2 years ago

Maybe, we are mixing things a bit here. switch_capability.get("MAX_NEXTHOP_GROUP_COUNT") querying SAI attribute SAI_SWITCH_ATTR_NUMBER_OF_ECMP_GROUPS and CRM nexthop_group querying SAI_SWITCH_ATTR_AVAILABLE_NEXT_HOP_GROUP_ENTRY. Can't we just get the nexthop_group available count directly from CRM show CLI output and use it?

It doesn't work with all platforms since some platforms share resources. e.g. Mellanox for which the check for CRM NHOP resource exhaustion is skipped.

This test case is semantically wrong as it is trying to cause resource exhaustion for route nexthop group based on what is returned for ecmp group, i.e.) reading from one SAI counter to cause resource exhaustion based on different SAI resource counter. I am not sure of other non-mellanox platform but in Cisco-8000 platform, switch_capability.get("MAX_NEXTHOP_GROUP_COUNT") going to return 8192 (number of ECMP groups supported) and CRM resource counter shown will be 16380(number of nexthop group supported.

smaheshm commented 2 years ago

There is no switch capability for next hop resource count, only nexthop group count. Using CRM resources is not reliable. In your case removing line '289' will work. Add platform specific checks if required.