sonic-net / sonic-platform-daemons

Platform module daemons for SONiC
Other
23 stars 152 forks source link

Xcvrd crash and restart should not cause link flap on platforms needing custom NPU SI settings #541

Closed mihirpat1 closed 1 week ago

mihirpat1 commented 3 weeks ago

Description

Ensure Xcvrd crash and restart should be handled gracefully on platforms that need custom SI settings. The goal is to avoid link flap on platforms that need custom SI settings during xcvrd restart.

Motivation and Context

Changeset 1 The changeset is partial implementation of the changes required for https://github.com/sonic-net/SONiC/pull/1432/. Specifically, the current change adds NPU_SI_SETTINGS_SYNC_STATUS key to the PORT_TABLE of STATE_DB. Following is the usage of NPU_SI_SETTINGS_SYNC_STATUS key

Value Modifier thread and event Consumer thread and purpose
NPU_SI_SETTINGS_DEFAULT 1. XCVRD main thread during cold start of XCVRD 2. SfpStateUpdateTask during transceiver removal XCVRD main thread during boot-up for deciding to notify NPU SI settings
NPU_SI_SETTINGS_NOTIFIED 1. SfpStateUpdateTask while updating and notifying the NPU SI settings Not being used currently

Please note that the above table needs to be further modified while implementing complete OA crash HLD. The table assumes NPU SI settings are applied from SfpStateUpdateTask thread always (unlike the HLD wherein CMIS transceivers have the NPU SI settings being sent from CmisManagerTask).

Changeset 2 In addition to the above change, TRANSCEIVER_INFO table will not be deleted as part of xcvrd deinit. On some platforms, TRANSCEIVER_INFO table is used to detect transceiver presence to handle host_tx_ready behavior and hence, deleting TRANSCEIVR_INFO table will cause host_tx_ready to change to false during xcvrd shutdown.

Impact due to skipping TRANSCEIVER_INFO table deletion

Scenario
In case if xcvrd crashes permanently after initializing 6 out of 32 ports (i.e. xcvrd never respawns), then "show int status" CLI will show 6 transceiver present and other ports would not show xcvr present unless we use sfputil. 

Changeset 3 PHYSICAL_PORT_NOT_EXIST has been defined in media_settings_parser.py to prevent crash due to undefined value. When the media_settings_parser.py file was created, the below line was moved from xcvrd.py. However, the definition of PHYSICAL_PORT_NOT_EXIST was never ported. Hence, adding the definition now. https://github.com/sonic-net/sonic-platform-daemons/blob/8c89f6ba2975a7699861f2c6d77083cebb62e97c/sonic-xcvrd/xcvrd/xcvrd_utilities/media_settings_parser.py#L280

How Has This Been Tested?

On platforms which do not restart xcvrd upon swss/syncd crash/restart, following behavior is observed The first xcvrd crash after swss/syncd restarts will cause a link flap for ports which require NPU SI settings. This is expected since as part of swss crash handling, the entire PORT_TABLE of STATE_DB is deleted. Since xcvrd is not restarted as part of swss/syncd crash handling, the NPU_SI_SETTINGS_SYNC_STATUS field is never populated. Hence, when xcvrd crashes/restarts first time after swss/syncd crash is triggered, xcvrd will update the NPU SI settings in the APPL_DB which in turn forces the port to go through a link flap while OA configures the NPU SI settings

Test summary

  1. Platforms with xcvrd not restarting during swss/syncd crash
Event Xcvrd restarted NPU SI settings renotify NPU_SI_SETTINGS_SYNC_STATUS value after action (for optics which require NPU SI settings) Link flap
Config shut N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Config no shut N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Xcvrd restart Yes No NPU_SI_SETTINGS_NOTIFIED No
Xcvrd kill Yes No NPU_SI_SETTINGS_NOTIFIED No
Crash xcvrd just before updating NPU_SI_SETTINGS_SYNC_STATUS Yes Yes NPU_SI_SETTINGS_NOTIFIED N/A
Pmon restart Yes No NPU_SI_SETTINGS_NOTIFIED No
Swss restart No No Key is deleted Yes
Syncd restart No No Key is deleted Yes
config reload Yes Yes NPU_SI_SETTINGS_NOTIFIED Yes
Cold reboot N/A No Inline with expected value N/A
  1. Platforms with xcvrd restarting during swss/syncd crash
Event Xcvrd restarted NPU SI settings renotify NPU_SI_SETTINGS_SYNC_STATUS value after action (for optics which require NPU SI settings) Link flap
Config shut N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Config no shut N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Xcvrd restart Yes No NPU_SI_SETTINGS_NOTIFIED No
Xcvrd kill Yes No NPU_SI_SETTINGS_NOTIFIED No
Crash xcvrd just before updating NPU_SI_SETTINGS_SYNC_STATUS Yes Yes NPU_SI_SETTINGS_NOTIFIED N/A
Pmon restart Yes No NPU_SI_SETTINGS_NOTIFIED No
Swss restart Yes Yes NPU_SI_SETTINGS_NOTIFIED Yes
Syncd restart Yes Yes NPU_SI_SETTINGS_NOTIFIED Yes
config reload Yes Yes NPU_SI_SETTINGS_NOTIFIED Yes
Cold reboot N/A No NPU_SI_SETTINGS_NOTIFIED N/A
Warm reboot N/A No NPU_SI_SETTINGS_NOTIFIED No
  1. Verified transceiver OIR

    Additional Information (Optional)

    MSFT ADO - 29278409

mihirpat1 commented 1 week ago

@bingwang-ms Can you please help to cherry-pick this PR to 202405 branch? MSFT ADO - 29278409

mssonicbld commented 1 week ago

Cherry-pick PR to 202405: https://github.com/sonic-net/sonic-platform-daemons/pull/547