sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
718 stars 1.38k forks source link

[sonic-platform-daemons] - Xcvrd CMIS Manager state machine is getting improperly reset during progression #11953

Open snider-nokia opened 2 years ago

snider-nokia commented 2 years ago

Description

Xcvrd CMIS Manager state machine is getting reset to CMIS_STATE_INSERTED from within on_port_update_event() method unnecessarily/improperly.

PR sonic-platform-daemons/#285 should be applied before attempting to reproduce and fix this issue.

Steps to reproduce the issue:

  1. Using a minimally CMIS 4.0 compliant transceiver module and associated port that is admin enabled, connect to an admin enabled remote peer port so that link (eventually) comes up.
  2. Save the configuration.
  3. Reboot local switch (line card), while leaving remote peer switch up and remote peer switch port admin enabled, and then monitor syslog PMON output during bringup.

Describe the results you received:

CMIS Manager state machine begins operating on the protagonist port transceiver and when it reaches CMIS_STATE_DP_INIT the module has its datapath initialized/configured. When this happens, if the remote peer interface is admin up then the link will come up. This will cause syncd originated Redis table update of port to oper 'up' state. The associated port update message that gets promulgated to Xcvrd will then cause Xcvrd to improperly reset the CMIS Manager state machine to CMIS_STATE_INSERTED.

Also of note here is a secondary unexpected behavior where, somehow, a subsequent port update message is promulgated to Xcvrd that shows the port as admin 'down' even though the actual admin state is 'up' and should never have changed. The genesis of this admin 'down' port update message is not currently understood and should be traced backward to its source for reconciliation.

Describe the results you expected:

CMIS Manager should ignore the oper state attribute/field in any arriving port update message, and should only reset the associated module to CMIS_STATE_INSERTED when admin state has toggled either from up-to-down or from down-to-up. Consecutive up-to-up and down-to-down admin status indications should be ignored/debounced.

Note that the above comment is made in a vacuum as relates to only oper_status and admin_status attributes/fields, and does not consider the situation where other germane transceiver related configuration attributes/fields (for example, 'laser_freq') are changed and thus may distinctly require reset of the state machine to CMIS_STATE_INSERTED (irrespective of admin status state).

Germane PMON logging is in black and my annotations in red.

image

Improper state machine reset happens here at line 1007

image

judyjoseph commented 2 years ago

@prgeor is taking a look

judyjoseph commented 1 year ago

@snider-nokia, please update any issues/results you notice in the tests with this PR - https://github.com/sonic-net/sonic-platform-daemons/pull/285

snider-nokia commented 1 year ago

The situation is much better with the latest changes made on Sept. 7th at [sonic-net/sonic-platform-daemons/pull/285]. I'd suggest those changes should be merged. The only problem remaining is that those latest changes at [sonic-net/sonic-platform-daemons/pull/285] do not build under 2205 (build target target/python-wheels/bullseye/sonic_xcvrd-1.0-py3-none-any.whl fails). Building of the aforementioned target needs to be fixed.

snider-nokia commented 1 year ago

Latest Sept. 13th changes at [sonic-net/sonic-platform-daemons/pull/285] resolve the 2205 build problem.
Target sonic_xcvrd-1.0-py3-none-any.whl now builds properly.