sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
730 stars 1.4k forks source link

orchagent crash after uplink lag flapping #9097

Open yxieca opened 2 years ago

yxieca commented 2 years ago

Description

orchagent crashed after uplink lag flapping

Steps to reproduce the issue:

  1. run dhcp_relay test
  2. orchagent crashed after the uplink flap test case

Describe the results you received:

Describe the results you expected:

Output of show version:

Output of show techsupport:


Oct 26 11:10:50.217289 str2-7215-acs-1 NOTICE swss#orchagent: :- addNextHop: Created next hop 10.0.0.63 on PortChannel0004
Oct 26 11:10:50.223345 str2-7215-acs-1 NOTICE swss#orchagent: :- updatePortOperStatus: Port PortChannel0002 oper state set from down to up
Oct 26 11:10:50.231231 str2-7215-acs-1 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_INVALID_PARAMETER
Oct 26 11:10:50.231587 str2-7215-acs-1 ERR swss#orchagent: :- create: create status: SAI_STATUS_INVALID_PARAMETER
Oct 26 11:10:50.231678 str2-7215-acs-1 ERR swss#orchagent: :- validnexthopinNextHopGroup: Failed to add next hop member to group 50000000006e2: -5
Oct 26 11:10:50.231759 str2-7215-acs-1 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP_GROUP, status:
SAI_STATUS_INVALID_PARAMETER
Oct 26 11:10:50.232396 str2-7215-acs-1 INFO syncd#/supervisord: syncd 11:10:50 SAI: ERROR NEXT_HOP_GROUP xpSaiNextHopGroup.c:1091 : Nexthop group (1407374883553285) already has NH id (11
25899906842632). Configuration not supported
Oct 26 11:10:50.232792 str2-7215-acs-1 NOTICE swss#orchagent: :- uninitialize: begin
Oct 26 11:10:50.233262 str2-7215-acs-1 NOTICE swss#orchagent: :- uninitialize: begin
Oct 26 11:10:50.233362 str2-7215-acs-1 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread begin
Oct 26 11:10:50.233491 str2-7215-acs-1 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread end
Oct 26 11:10:50.233598 str2-7215-acs-1 NOTICE swss#orchagent: :- clear_local_state: clearing local state
Oct 26 11:10:50.233710 str2-7215-acs-1 NOTICE swss#orchagent: :- meta_init_db: begin
Oct 26 11:10:50.254825 str2-7215-acs-1 NOTICE swss#orchagent: :- meta_init_db: end
Oct 26 11:10:50.254825 str2-7215-acs-1 NOTICE swss#orchagent: :- uninitialize: end
Oct 26 11:10:50.254825 str2-7215-acs-1 NOTICE swss#orchagent: :- stopRecording: stopped recording
Oct 26 11:10:50.254825 str2-7215-acs-1 NOTICE swss#orchagent: :- stopRecording: closed recording file: sairedis.rec
Oct 26 11:10:50.254825 str2-7215-acs-1 NOTICE swss#orchagent: :- uninitialize: end```

#### Additional information you deem important (e.g. issue happens only occasionally):

[sonic_dump_str2-7215-acs-1_20211026_180502.tar.gz](https://github.com/Azure/sonic-buildimage/files/7434987/sonic_dump_str2-7215-acs-1_20211026_180502.tar.gz)
yxieca commented 2 years ago

Nightly test was running with following patch to debug DHCP relay test. The patch did a config reload before dhcp relay test, so the failure we encountered here should not affected by any test ran before dhcp relay.

` diff --git a/tests/dhcp_relay/test_dhcp_relay.py b/tests/dhcp_relay/test_dhcp_relay.py index 788cc1e1c..0280a359d 100644 --- a/tests/dhcp_relay/test_dhcp_relay.py +++ b/tests/dhcp_relay/test_dhcp_relay.py @@ -11,6 +11,8 @@ from tests.ptf_runner import ptf_runner from tests.common.utilities import wait_until from tests.common.helpers.dut_utils import check_link_status from tests.common.helpers.assertions import pytest_assert +from tests.common import config_reload +from tests.common.platform.processes_utils import wait_critical_processes

pytestmark = [ @@ -25,6 +27,37 @@ DUAL_TOR_MODE = 'dual'

logger = logging.getLogger(name)

+@pytest.fixture(autouse=True, scope="module") +def debug_dhcp_relay_issue_clean_start(duthosts, rand_one_dut_hostname):

yxieca commented 2 years ago

sonic_dump_str2-7215-acs-1_20211026_180502.tar.gz

radha-danda commented 2 years ago

@yxieca, We are not seeing this issue. Can you please share the details on how to reproduce the issue?