Currently, the CmisManagerTask thread crashes upon encountering an exception which causes the entire XCVRD process to restart. The CmisManagerTask thread crash scenarios are more often seen during instances of failure to read EEPROM of the transceivers.
Crash snippet
Apr 1 13:39:29.652330 STG01-0101-0200-02T2-lc01 ERR pmon#: Exception occured at CmisManagerTask thread due to TypeError("'NoneType' object is not subscriptable")
Apr 1 13:39:29.654469 STG01-0101-0200-02T2-lc01 ERR pmon#: Traceback (most recent call last):
Apr 1 13:39:29.654498 STG01-0101-0200-02T2-lc01 ERR pmon#: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1693, in run
Apr 1 13:39:29.654498 STG01-0101-0200-02T2-lc01 ERR pmon#: self.task_worker()
Apr 1 13:39:29.654498 STG01-0101-0200-02T2-lc01 ERR pmon#: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1655, in task_worker
Apr 1 13:39:29.654518 STG01-0101-0200-02T2-lc01 ERR pmon#: if not self.check_datapath_state(api, host_lanes_mask, ['DataPathInitialized']):
Apr 1 13:39:29.654531 STG01-0101-0200-02T2-lc01 ERR pmon#: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1263, in check_datapath_state
Apr 1 13:39:29.654531 STG01-0101-0200-02T2-lc01 ERR pmon#: if dpstate[key] not in states:
Apr 1 13:39:29.654569 STG01-0101-0200-02T2-lc01 ERR pmon#: TypeError: 'NoneType' object is not subscriptable
Apr 1 13:39:29.654752 STG01-0101-0200-02T2-lc01 ERR pmon#: Xcvrd: exception found at child thread CmisManagerTask due to TypeError("'NoneType' object is not subscriptable")
Apr 1 13:39:29.654752 STG01-0101-0200-02T2-lc01 ERR pmon#: Exiting main loop as child thread raised exception!
Motivation and Context
In order to avoid restarting of XCVRD triggered due to CmisManagerTask thread crash, this PR will ensure to move the CMIS SM to CMIS_STATE_FAILED state for the corresponding ports which have generated an exception. This will also help in ensuring that if module EEPROM access fails for 1 or more ports, the corresponding port will transition to CMIS_STATE_FAILED instead.
How Has This Been Tested?
An exception was manually generated while CMIS SM was in CMIS_STATE_INSERTED and it was ensured that XCVRD did not crash.
Apr 30 08:58:01.283582 sonic NOTICE pmon#xcvrd[16173]: CMIS: Ethernet0: 400G, lanemask=0xff, state=INSERTED, appl 1 host_lane_count 8 retries=0
Apr 30 08:58:01.283582 sonic ERR pmon#xcvrd[16173]: CMIS: Ethernet0: internal errors due to 'PATELMI: Simulated KeyError!!!'
Apr 30 08:58:01.285296 sonic ERR pmon#xcvrd[16173]: Traceback (most recent call last):
Apr 30 08:58:01.285296 sonic ERR pmon#xcvrd[16173]: File "/usr/local/lib/python3.11/dist-packages/xcvrd/xcvrd.py", line 1404, in task_worker
Apr 30 08:58:01.285296 sonic ERR pmon#xcvrd[16173]: raise KeyError("PATELMI: Simulated KeyError!!!")
Apr 30 08:58:01.285296 sonic ERR pmon#xcvrd[16173]: KeyError: 'PATELMI: Simulated KeyError!!!'
root@sonic:/home/admin# redis-cli -n 6 hget "TRANSCEIVER_STATUS|Ethernet0" cmis_state
"FAILED"
root@sonic:/home/admin#
Also, CMIS initialization was successful on the same port after the exception was not seen any more.
Description
Currently, the CmisManagerTask thread crashes upon encountering an exception which causes the entire XCVRD process to restart. The CmisManagerTask thread crash scenarios are more often seen during instances of failure to read EEPROM of the transceivers.
Crash snippet
Motivation and Context
In order to avoid restarting of XCVRD triggered due to CmisManagerTask thread crash, this PR will ensure to move the CMIS SM to
CMIS_STATE_FAILED
state for the corresponding ports which have generated an exception. This will also help in ensuring that if module EEPROM access fails for 1 or more ports, the corresponding port will transition toCMIS_STATE_FAILED
instead.How Has This Been Tested?
An exception was manually generated while CMIS SM was in
CMIS_STATE_INSERTED
and it was ensured that XCVRD did not crash.Also, CMIS initialization was successful on the same port after the exception was not seen any more.
Additional Information (Optional)
MSFT ADO - 27441561