sonic-net / sonic-swss

SONiC Switch State Service (SwSS)
https://azure.github.io/SONiC
Other
170 stars 516 forks source link

[202405][dualtor] Orchagent is going down during switchover #3298

Open vkjammala-arista opened 2 days ago

vkjammala-arista commented 2 days ago

Description

When performing a switchover (say active to standby or viceversa), we are observing orchagent process going down and thus leaving mux status in inconsistent state.

Based on the observations from debug logs, we thought using bulker for programming the routes/neighbors (introduced by PR: https://github.com/sonic-net/sonic-swss/pull/3148) is the problem and confirmed the same by running the tests after reverting the PR changes.

Steps to reproduce the issue:

  1. Run any sonic-mgmt test (Ex: tests/dualtor_io/test_link_failure.py) performing switchover (say using toggle_all_simulator_ports_to_rand_selected_tor or similar fixture which performs switchover during test setup).

Describe the results you received:

  1. Tests will fail with Failed to toggle all ports to <tor_device> from mux simulator as mux status will be left in inconsistent state.
    def _toggle_all_simulator_ports_to_target_dut(target_dut_hostname, duthosts, mux_server_url, tbinfo):
        """Helper function to toggle all ports to active on the target DUT."""
        ...
        if not is_toggle_done and \
                not utilities.wait_until(120, 10, 0, _check_toggle_done, duthosts, target_dut_hostname, probe=True):
    &gt;           pytest_assert(False, "Failed to toggle all ports to {} from mux simulator".format(target_dut_hostname))
    E           Failed: Failed to toggle all ports to ld301 from mux simulator```
  2. Orchagent process in swss docker container will be down (can we verified with ps aux inside swss container)

Describe the results you expected:

Switchover should have completed without any failures.

Additional information you deem important:

Some of the debug logs captured during the switchover,

2024 Sep 18 17:47:12.847339 gd377 NOTICE swss#orchagent: :- nbrHandler: Processing neighbors for mux Ethernet200, enable 0, state 2
2024 Sep 18 17:47:12.847339 gd377 INFO swss#orchagent: :- updateRoutes: Updating routes pointing to multiple mux nexthops
...
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- addRoutes: Adding route entry 192.168.0.44, nh 400000000167a to bulker
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- create_entry: EntityBulker.create_entry 1, 2, 1
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- addRoutes: Adding route entry fc02:1000::2c, nh 400000000167a to bulker
2024 Sep 18 17:47:12.851834 gd377 INFO swss#orchagent: :- create_entry: EntityBulker.create_entry 2, 2, 1
2024 Sep 18 17:47:12.851834 gd377 DEBUG swss#orchagent: :> redis_bulk_create_route_entry: enter
2024 Sep 18 17:47:12.851834 gd377 DEBUG swss#orchagent: :> bulkCreate: enter
...
...
2024 Sep 18 17:47:12.881418 gd377 DEBUG swss#orchagent: :> waitForBulkResponse: enter
...
...
2024 Sep 18 17:47:12.886416 gd377 DEBUG swss#orchagent: :- processReply: got message: ["switch_shutdown_request","{\"switch_id\":\"oid:0x21000000000000\"}"]
...
...
2024 Sep 18 17:48:12.935572 gd377 DEBUG swss#orchagent: :> on_switch_shutdown_request: enter
2024 Sep 18 17:48:12.935597 gd377 ERR swss#orchagent: :- on_switch_shutdown_request: Syncd stopped
2024 Sep 18 17:48:12.946670 gd377 INFO swss#supervisord 2024-09-18 17:48:12,945 WARN exited: orchagent (exit status 1; not expected)

Based on the debug logs captured during multiple test runs we suspected usage of bulker entity is causing orchagent to go down for some reason. And tried running the tests by reverting PR https://github.com/sonic-net/sonic-swss/pull/3148 :[muxorch] Using bulker to program routes/neighbors during switchover and tests are passing.

yxieca commented 1 day ago

@prsunny @Ndancejic can you assess this issue?

yxieca commented 1 day ago

@bingwang-ms FYI