sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
728 stars 1.39k forks source link

[DNX] Orchagent/Syncd crash due to `ECMP hash offset set failed with error -2` #19059

Closed arista-nwolfe closed 3 months ago

arista-nwolfe commented 4 months ago

On latest master build on DNX platforms we're seeing that the orchagent and syncd containers are crashing due to an unsupported SAI call.

ERR syncd1#syncd: [07:00.0] SAI_API_SWITCH:brcm_sai_set_switch_attribute:5078 ECMP hash offset set failed with error -2.
ERR syncd1#syncd: :- sendApiResponse: api SAI_COMMON_API_SET failed in syncd mode: SAI_STATUS_NOT_SUPPORTED
ERR syncd1#syncd: :- processQuadEvent: VID: oid:0x121000000000001 RID: oid:0x8850012100000000
ERR syncd1#syncd: :- processQuadEvent: attr: SAI_SWITCH_ATTR_ECMP_DEFAULT_HASH_OFFSET: 10
ERR swss1#orchagent: :- set: set status: SAI_STATUS_NOT_SUPPORTED
ERR swss1#orchagent: :- doAppSwitchTableTask: Failed to set switch attribute ecmp_hash_offset to 10, rv:-2
ERR swss1#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_SWITCH, status: SAI_STATUS_NOT_SUPPORTED
NOTICE swss1#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
NOTICE syncd1#syncd: :- processNotifySyncd: Invoking SAI failure dump
NOTICE swss1#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded
WARNING syncd0#syncd: message repeated 59 times: [ [06:00.0] SAI_API_UNSPECIFIED:sai_bulk_object_get_stats:748 Unsupported object type type 1]
NOTICE syncd0#syncd: :- threadFunction: time span 81 ms for 'start_poll:FABRIC_PORT_STAT_COUNTER:oid:0x10000000000b8'
INFO swss1#supervisord 2024-05-23 20:20:15,295 WARN exited: orchagent (terminated by SIGABRT (core dumped); not expected)

The current DNX SAI on master is 10.1.15 and we can see in this SAI call it's not supported on DNX

_brcm_sai_loadbalance_ecmp_hash_offset_set(unsigned int val)
{
    int rv;
    unsigned int hashf_offset;

    BRCM_SAI_LOG_SWITCH(SAI_LOG_LEVEL_DEBUG, "Ecmp hash offset set %u", val);
    if (DEV_IS_DNX())
    {
        return SAI_STATUS_NOT_SUPPORTED;
    }

The SAI call was added to orchagent by https://github.com/sonic-net/sonic-buildimage/pull/18912

kenneth-arista commented 4 months ago

@arlakshm @ysmanman for awareness

lguohan commented 4 months ago

@kperumalbfn , can you check this one? i thought the orchagent change was not merged, why the buildimage change is causing the breakage?

kperumalbfn commented 4 months ago

@arista-nwolfe @kenneth-arista sonic-swss PR - https://github.com/sonic-net/sonic-swss/pull/3138/files checks for SAI attribute capability before invoking set_switch_attribute API. This PR is already merged.

Based on the above code snippet from SDK, BCM SDK returns 'true' for the attribute capability support, but it returns failure for set_switch API and that is incorrect. Could you update the SDK to return unsupported or not_implemented for DNX platform for the 2 SAI attributes and that should avoid this switch initialization crash.

lguohan commented 4 months ago

@kenneth-arista , can you help to create CSP and ask brcm to fix it?

arista-nwolfe commented 4 months ago

Created CS00012352219 to track this

arlakshm commented 4 months ago

@mlok-nokia @saksarav-nokia for viz...

arista-nwolfe commented 4 months ago

Broadcom has a fix that Arista has confirmed works. Broadcom will add this fix to the next 10.x SAI

kenneth-arista commented 3 months ago

DNX SAI 10.1.20 has the fix.

rlhui commented 3 months ago

Arista confirmed issue is fixed.