Ensure Xcvrd crash and restart should be handled gracefully on platforms that need custom SI settings.
The goal is to avoid link flap on platforms that need custom SI settings during xcvrd restart.
Motivation and Context
Changeset 1
The changeset is partial implementation of the changes required for https://github.com/sonic-net/SONiC/pull/1432/.
Specifically, the current change adds NPU_SI_SETTINGS_SYNC_STATUS key to the PORT_TABLE of STATE_DB. Following is the usage of NPU_SI_SETTINGS_SYNC_STATUS key
Value
Modifier thread and event
Consumer thread and purpose
NPU_SI_SETTINGS_DEFAULT
1. XCVRD main thread during cold start of XCVRD 2. SfpStateUpdateTask during transceiver removal
XCVRD main thread during boot-up for deciding to notify NPU SI settings
NPU_SI_SETTINGS_NOTIFIED
1. SfpStateUpdateTask while updating and notifying the NPU SI settings
Not being used currently
Please note that the above table needs to be further modified while implementing complete OA crash HLD. The table assumes NPU SI settings are applied from SfpStateUpdateTask thread always (unlike the HLD wherein CMIS transceivers have the NPU SI settings being sent from CmisManagerTask).
Changeset 2
In addition to the above change, TRANSCEIVER_INFO table will not be deleted as part of xcvrd deinit.
On some platforms, TRANSCEIVER_INFO table is used to detect transceiver presence to handle host_tx_ready behavior and hence, deleting TRANSCEIVR_INFO table will cause host_tx_ready to change to false during xcvrd shutdown.
Impact due to skipping TRANSCEIVER_INFO table deletion
Scenario
In case if xcvrd crashes permanently after initializing 6 out of 32 ports (i.e. xcvrd never respawns), then "show int status" CLI will show 6 transceiver present and other ports would not show xcvr present unless we use sfputil.
On platforms which do not restart xcvrd upon swss/syncd crash/restart, following behavior is observed
The first xcvrd crash after swss/syncd restarts will cause a link flap for ports which require NPU SI settings.
This is expected since as part of swss crash handling, the entire PORT_TABLE of STATE_DB is deleted.
Since xcvrd is not restarted as part of swss/syncd crash handling, the NPU_SI_SETTINGS_SYNC_STATUS field is never populated. Hence, when xcvrd crashes/restarts first time after swss/syncd crash is triggered, xcvrd will update the NPU SI settings in the APPL_DB which in turn forces the port to go through a link flap while OA configures the NPU SI settings
Test summary
Platforms with xcvrd not restarting during swss/syncd crash
Event
Xcvrd restarted
NPU SI settings renotify
NPU_SI_SETTINGS_SYNC_STATUS value after action (for optics which require NPU SI settings)
Link flap
Config shut
N/A
No
NPU_SI_SETTINGS_NOTIFIED
N/A
Config no shut
N/A
No
NPU_SI_SETTINGS_NOTIFIED
N/A
Xcvrd restart
Yes
No
NPU_SI_SETTINGS_NOTIFIED
No
Xcvrd kill
Yes
No
NPU_SI_SETTINGS_NOTIFIED
No
Crash xcvrd just before updating NPU_SI_SETTINGS_SYNC_STATUS
Yes
Yes
NPU_SI_SETTINGS_NOTIFIED
N/A
Pmon restart
Yes
No
NPU_SI_SETTINGS_NOTIFIED
No
Swss restart
No
No
Key is deleted
Yes
Syncd restart
No
No
Key is deleted
Yes
config reload
Yes
Yes
NPU_SI_SETTINGS_NOTIFIED
Yes
Cold reboot
N/A
No
Inline with expected value
N/A
Platforms with xcvrd restarting during swss/syncd crash
Event
Xcvrd restarted
NPU SI settings renotify
NPU_SI_SETTINGS_SYNC_STATUS value after action (for optics which require NPU SI settings)
Link flap
Config shut
N/A
No
NPU_SI_SETTINGS_NOTIFIED
N/A
Config no shut
N/A
No
NPU_SI_SETTINGS_NOTIFIED
N/A
Xcvrd restart
Yes
No
NPU_SI_SETTINGS_NOTIFIED
No
Xcvrd kill
Yes
No
NPU_SI_SETTINGS_NOTIFIED
No
Crash xcvrd just before updating NPU_SI_SETTINGS_SYNC_STATUS
Description
Ensure Xcvrd crash and restart should be handled gracefully on platforms that need custom SI settings. The goal is to avoid link flap on platforms that need custom SI settings during xcvrd restart.
Motivation and Context
Changeset 1 The changeset is partial implementation of the changes required for https://github.com/sonic-net/SONiC/pull/1432/. Specifically, the current change adds
NPU_SI_SETTINGS_SYNC_STATUS
key to the PORT_TABLE of STATE_DB. Following is the usage ofNPU_SI_SETTINGS_SYNC_STATUS
keyPlease note that the above table needs to be further modified while implementing complete OA crash HLD. The table assumes NPU SI settings are applied from SfpStateUpdateTask thread always (unlike the HLD wherein CMIS transceivers have the NPU SI settings being sent from CmisManagerTask).
Changeset 2 In addition to the above change,
TRANSCEIVER_INFO
table will not be deleted as part of xcvrd deinit. On some platforms,TRANSCEIVER_INFO
table is used to detect transceiver presence to handlehost_tx_ready
behavior and hence, deletingTRANSCEIVR_INFO
table will causehost_tx_ready
to change tofalse
during xcvrd shutdown.Impact due to skipping TRANSCEIVER_INFO table deletion
Changeset 3
PHYSICAL_PORT_NOT_EXIST
has been defined in media_settings_parser.py to prevent crash due to undefined value. When the media_settings_parser.py file was created, the below line was moved from xcvrd.py. However, the definition ofPHYSICAL_PORT_NOT_EXIST
was never ported. Hence, adding the definition now. https://github.com/sonic-net/sonic-platform-daemons/blob/8c89f6ba2975a7699861f2c6d77083cebb62e97c/sonic-xcvrd/xcvrd/xcvrd_utilities/media_settings_parser.py#L280How Has This Been Tested?
On platforms which do not restart xcvrd upon swss/syncd crash/restart, following behavior is observed The first xcvrd crash after swss/syncd restarts will cause a link flap for ports which require NPU SI settings. This is expected since as part of swss crash handling, the entire PORT_TABLE of STATE_DB is deleted. Since xcvrd is not restarted as part of swss/syncd crash handling, the NPU_SI_SETTINGS_SYNC_STATUS field is never populated. Hence, when xcvrd crashes/restarts first time after swss/syncd crash is triggered, xcvrd will update the NPU SI settings in the APPL_DB which in turn forces the port to go through a link flap while OA configures the NPU SI settings
Test summary
Additional Information (Optional)
MSFT ADO - 29278409