sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
736 stars 1.42k forks source link

[xcvrd] xcvrd crashes during breakout port #18893

Open HaiHuang008 opened 6 months ago

HaiHuang008 commented 6 months ago

Description

The ports of the device are all connected to OSFP cables. The first time a port is breakout, xcvrd crashes out and then it initializes all ports again.

xcvrd crashes:

May  7 18:48:08.371367 sonic WARNING pmon#xcvrd: $$$ Ethernet0 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/1', 'index': '1', 'lanes': '9', 'speed': '100000'}
May  7 18:48:08.371500 sonic WARNING pmon#xcvrd: *** Ethernet0CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/1', 'index': '1', 'lanes': '9', 'speed': '100000', 'key': 'Ethernet0', 'asic_id': 0, 'op': 'SET'}
May  7 18:48:08.371322 sonic INFO ConfigMgmt: Writing in Config DB
May  7 18:48:08.371505 sonic INFO ConfigMgmt: Write in DB: {'PORT': {'Ethernet0': {'alias': 'Eth1/1', 'lanes': '9', 'speed': '100000', 'index': '1'}, 'Ethernet1': {'alias': 'Eth1/2', 'lanes': '10', 'speed': '100000', 'index': '1'}, 'Ethernet2': {'alias': 'Eth1/3', 'lanes': '11', 'speed': '100000', 'index': '1'}, 'Ethernet3': {'alias': 'Eth1/4', 'lanes': '12', 'speed': '100000', 'index': '1'}, 'Ethernet4': {'alias': 'Eth1/5', 'lanes': '13', 'speed': '100000', 'index': '1'}, 'Ethernet5': {'alias': 'Eth1/6', 'lanes': '14', 'speed': '100000', 'index': '1'}, 'Ethernet6': {'alias': 'Eth1/7', 'lanes': '15', 'speed': '100000', 'index': '1'}, 'Ethernet7': {'alias': 'Eth1/8', 'lanes': '16', 'speed': '100000', 'index': '1'}}}
May  7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet1 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/2', 'index': '1', 'lanes': '10', 'speed': '100000'}
May  7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet2 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/3', 'index': '1', 'lanes': '11', 'speed': '100000'}
May  7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet3 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/4', 'index': '1', 'lanes': '12', 'speed': '100000'}
May  7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet4 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/5', 'index': '1', 'lanes': '13', 'speed': '100000'}
May  7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet5 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/6', 'index': '1', 'lanes': '14', 'speed': '100000'}
May  7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet6 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/7', 'index': '1', 'lanes': '15', 'speed': '100000'}
May  7 18:48:08.395630 sonic WARNING pmon#xcvrd: $$$ Ethernet7 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/8', 'index': '1', 'lanes': '16', 'speed': '100000'}
May  7 18:48:08.395892 sonic WARNING pmon#xcvrd: *** Ethernet1CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/2', 'index': '1', 'lanes': '10', 'speed': '100000', 'key': 'Ethernet1', 'asic_id': 0, 'op': 'SET'}
May  7 18:48:08.395892 sonic WARNING pmon#xcvrd: *** Ethernet2CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/3', 'index': '1', 'lanes': '11', 'speed': '100000', 'key': 'Ethernet2', 'asic_id': 0, 'op': 'SET'}
May  7 18:48:08.395892 sonic WARNING pmon#xcvrd: *** Ethernet3CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/4', 'index': '1', 'lanes': '12', 'speed': '100000', 'key': 'Ethernet3', 'asic_id': 0, 'op': 'SET'}
May  7 18:48:08.395892 sonic WARNING pmon#xcvrd: *** Ethernet4CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/5', 'index': '1', 'lanes': '13', 'speed': '100000', 'key': 'Ethernet4', 'asic_id': 0, 'op': 'SET'}
May  7 18:48:08.395892 sonic WARNING pmon#xcvrd: *** Ethernet5CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/6', 'index': '1', 'lanes': '14', 'speed': '100000', 'key': 'Ethernet5', 'asic_id': 0, 'op': 'SET'}
May  7 18:48:08.395892 sonic WARNING pmon#xcvrd: *** Ethernet6CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/7', 'index': '1', 'lanes': '15', 'speed': '100000', 'key': 'Ethernet6', 'asic_id': 0, 'op': 'SET'}
May  7 18:48:08.395892 sonic WARNING pmon#xcvrd: *** Ethernet7CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/8', 'index': '1', 'lanes': '16', 'speed': '100000', 'key': 'Ethernet7', 'asic_id': 0, 'op': 'SET'}
May  7 18:48:08.403956 sonic ERR pmon#xcvrd: Exception occured at CmisManagerTask thread due to KeyError(None)
May  7 18:48:08.404851 sonic ERR pmon#xcvrd: Traceback (most recent call last):
May  7 18:48:08.404851 sonic ERR pmon#xcvrd:   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1718, in run
May  7 18:48:08.404851 sonic ERR pmon#xcvrd:     self.task_worker()
May  7 18:48:08.404851 sonic ERR pmon#xcvrd:   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1428, in task_worker
May  7 18:48:08.404888 sonic ERR pmon#xcvrd:     self.port_dict[lport]['host_tx_ready'] = self.get_host_tx_status(lport)
May  7 18:48:08.404888 sonic ERR pmon#xcvrd:   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1300, in get_host_tx_status
May  7 18:48:08.405015 sonic ERR pmon#xcvrd:     state_port_tbl = self.xcvr_table_helper.get_state_port_tbl(asic_index)
May  7 18:48:08.405015 sonic ERR pmon#xcvrd:   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 2632, in get_state_port_tbl
May  7 18:48:08.405015 sonic ERR pmon#xcvrd:     return self.state_port_tbl[asic_id]
May  7 18:48:08.405015 sonic ERR pmon#xcvrd: KeyError: None
May  7 18:48:08.405015 sonic ERR pmon#xcvrd: Xcvrd: exception found at child thread CmisManagerTask due to KeyError(None)
May  7 18:48:08.405033 sonic ERR pmon#xcvrd: Exiting main loop as child thread raised exception!
May  7 18:48:08.414110 sonic INFO pmon#supervisord 2024-05-07 18:48:08,413 INFO exited: xcvrd (terminated by SIGKILL; not expected)

Due to xcvrd crashes, xcvrd is initialized again, and TRANSCEIVER INFO will be traversed and filled again. As a result, the Media type display of the current port is very slow (about two minutes) in show interface status, and it will also affect other ports.

Steps to reproduce the issue:

  1. OSFP cables connected to all ports
  2. From the current port mode of 2x400GBreakout to 8x100G sudo config interface breakout Ethernet0 8x100G -f -y
  3. show interfaces status
    
    admin@sonic:~$ show interfaces status
    Interface            Lanes    Speed    MTU    FEC    Alias    Vlan    Oper    Admin                           Type    Asym PFC
    -----------  ---------------  -------  -----  -----  -------  ------  ------  -------  -----------------------------  ----------
    Ethernet0                9     100G   9100    N/A   Eth1/1  routed    down     down  OSFP 8X Pluggable Transceiver         N/A
    Ethernet1               10     100G   9100    N/A   Eth1/2  routed    down     down  OSFP 8X Pluggable Transceiver         N/A
    Ethernet2               11     100G   9100    N/A   Eth1/3  routed    down     down  OSFP 8X Pluggable Transceiver         N/A
    Ethernet3               12     100G   9100    N/A   Eth1/4  routed    down     down  OSFP 8X Pluggable Transceiver         N/A
    Ethernet4               13     100G   9100    N/A   Eth1/5  routed    down     down  OSFP 8X Pluggable Transceiver         N/A
    Ethernet5               14     100G   9100    N/A   Eth1/6  routed    down     down  OSFP 8X Pluggable Transceiver         N/A
    Ethernet6               15     100G   9100    N/A   Eth1/7  routed    down     down  OSFP 8X Pluggable Transceiver         N/A
    Ethernet7               16     100G   9100    N/A   Eth1/8  routed    down     down  OSFP 8X Pluggable Transceiver         N/A
    Ethernet8          1,2,3,4     400G   9100    N/A   Eth2/1  routed      up       up  OSFP 8X Pluggable Transceiver         N/A
    Ethernet12          5,6,7,8     400G   9100    N/A   Eth2/2  routed      up       up  OSFP 8X Pluggable Transceiver         N/A

admin@sonic:~$ show interfaces breakout current-mode +-------------+-------------------------+ | Interface | Current Breakout Mode | +=============+=========================+ | Ethernet0 | 8x100G | +-------------+-------------------------+ | Ethernet8 | 2x400G | +-------------+-------------------------+ | Ethernet16 | 8x100G | +-------------+-------------------------+ | Ethernet24 | 2x400G | +-------------+-------------------------+ | Ethernet32 | 2x400G | +-------------+-------------------------+

4. Check logs

#### Describe the results you received:

May 7 18:48:08.371367 sonic WARNING pmon#xcvrd: $$$ Ethernet0 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/1', 'index': '1', 'lanes': '9', 'speed': '100000'} May 7 18:48:08.371500 sonic WARNING pmon#xcvrd: Ethernet0CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/1', 'index': '1', 'lanes': '9', 'speed': '100000', 'key': 'Ethernet0', 'asic_id': 0, 'op': 'SET'} May 7 18:48:08.371322 sonic INFO ConfigMgmt: Writing in Config DB May 7 18:48:08.371505 sonic INFO ConfigMgmt: Write in DB: {'PORT': {'Ethernet0': {'alias': 'Eth1/1', 'lanes': '9', 'speed': '100000', 'index': '1'}, 'Ethernet1': {'alias': 'Eth1/2', 'lanes': '10', 'speed': '100000', 'index': '1'}, 'Ethernet2': {'alias': 'Eth1/3', 'lanes': '11', 'speed': '100000', 'index': '1'}, 'Ethernet3': {'alias': 'Eth1/4', 'lanes': '12', 'speed': '100000', 'index': '1'}, 'Ethernet4': {'alias': 'Eth1/5', 'lanes': '13', 'speed': '100000', 'index': '1'}, 'Ethernet5': {'alias': 'Eth1/6', 'lanes': '14', 'speed': '100000', 'index': '1'}, 'Ethernet6': {'alias': 'Eth1/7', 'lanes': '15', 'speed': '100000', 'index': '1'}, 'Ethernet7': {'alias': 'Eth1/8', 'lanes': '16', 'speed': '100000', 'index': '1'}}} May 7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet1 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/2', 'index': '1', 'lanes': '10', 'speed': '100000'} May 7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet2 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/3', 'index': '1', 'lanes': '11', 'speed': '100000'} May 7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet3 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/4', 'index': '1', 'lanes': '12', 'speed': '100000'} May 7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet4 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/5', 'index': '1', 'lanes': '13', 'speed': '100000'} May 7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet5 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/6', 'index': '1', 'lanes': '14', 'speed': '100000'} May 7 18:48:08.395578 sonic WARNING pmon#xcvrd: $$$ Ethernet6 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/7', 'index': '1', 'lanes': '15', 'speed': '100000'} May 7 18:48:08.395630 sonic WARNING pmon#xcvrd: $$$ Ethernet7 handle_port_update_event() : op=SET DB:CONFIG_DB Table:PORT fvp {'alias': 'Eth1/8', 'index': '1', 'lanes': '16', 'speed': '100000'} May 7 18:48:08.395892 sonic WARNING pmon#xcvrd: Ethernet1CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/2', 'index': '1', 'lanes': '10', 'speed': '100000', 'key': 'Ethernet1', 'asic_id': 0, 'op': 'SET'} May 7 18:48:08.395892 sonic WARNING pmon#xcvrd: Ethernet2CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/3', 'index': '1', 'lanes': '11', 'speed': '100000', 'key': 'Ethernet2', 'asic_id': 0, 'op': 'SET'} May 7 18:48:08.395892 sonic WARNING pmon#xcvrd: Ethernet3CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/4', 'index': '1', 'lanes': '12', 'speed': '100000', 'key': 'Ethernet3', 'asic_id': 0, 'op': 'SET'} May 7 18:48:08.395892 sonic WARNING pmon#xcvrd: Ethernet4CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/5', 'index': '1', 'lanes': '13', 'speed': '100000', 'key': 'Ethernet4', 'asic_id': 0, 'op': 'SET'} May 7 18:48:08.395892 sonic WARNING pmon#xcvrd: Ethernet5CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/6', 'index': '1', 'lanes': '14', 'speed': '100000', 'key': 'Ethernet5', 'asic_id': 0, 'op': 'SET'} May 7 18:48:08.395892 sonic WARNING pmon#xcvrd: Ethernet6CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/7', 'index': '1', 'lanes': '15', 'speed': '100000', 'key': 'Ethernet6', 'asic_id': 0, 'op': 'SET'} May 7 18:48:08.395892 sonic WARNING pmon#xcvrd: Ethernet7CONFIG_DBPORT handle_port_update_event() fvp {'alias': 'Eth1/8', 'index': '1', 'lanes': '16', 'speed': '100000', 'key': 'Ethernet7', 'asic_id': 0, 'op': 'SET'} May 7 18:48:08.403956 sonic ERR pmon#xcvrd: Exception occured at CmisManagerTask thread due to KeyError(None) May 7 18:48:08.404851 sonic ERR pmon#xcvrd: Traceback (most recent call last): May 7 18:48:08.404851 sonic ERR pmon#xcvrd: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1718, in run May 7 18:48:08.404851 sonic ERR pmon#xcvrd: self.task_worker() May 7 18:48:08.404851 sonic ERR pmon#xcvrd: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1428, in task_worker May 7 18:48:08.404888 sonic ERR pmon#xcvrd: self.port_dict[lport]['host_tx_ready'] = self.get_host_tx_status(lport) May 7 18:48:08.404888 sonic ERR pmon#xcvrd: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1300, in get_host_tx_status May 7 18:48:08.405015 sonic ERR pmon#xcvrd: state_port_tbl = self.xcvr_table_helper.get_state_port_tbl(asic_index) May 7 18:48:08.405015 sonic ERR pmon#xcvrd: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 2632, in get_state_port_tbl May 7 18:48:08.405015 sonic ERR pmon#xcvrd: return self.state_port_tbl[asic_id] May 7 18:48:08.405015 sonic ERR pmon#xcvrd: KeyError: None May 7 18:48:08.405015 sonic ERR pmon#xcvrd: Xcvrd: exception found at child thread CmisManagerTask due to KeyError(None) May 7 18:48:08.405033 sonic ERR pmon#xcvrd: Exiting main loop as child thread raised exception! May 7 18:48:08.414110 sonic INFO pmon#supervisord 2024-05-07 18:48:08,413 INFO exited: xcvrd (terminated by SIGKILL; not expected)

May 7 18:48:09.540293 sonic NOTICE pmon#xcvrd[4227]: Starting up... May 7 18:48:09.540293 sonic NOTICE pmon#xcvrd[4227]: XCVRD INIT: Start daemon init... May 7 18:48:09.670233 sonic NOTICE pmon#xcvrd[4227]: XCVRD INIT: Wait for port config is done May 7 18:48:09.673361 sonic NOTICE pmon#xcvrd[4227]: XCVRD INIT: After port config is done May 7 18:48:09.679986 sonic NOTICE pmon#xcvrd[4227]: Start daemon main loop with thread count 3 May 7 18:48:09.680067 sonic NOTICE pmon#xcvrd[4227]: Started thread CmisManagerTask May 7 18:48:09.680067 sonic NOTICE pmon#xcvrd[4227]: Started thread DomInfoUpdateTask May 7 18:48:09.680077 sonic NOTICE pmon#xcvrd[4227]: Started thread SfpStateUpdateTask

#### Describe the results you expected:

#### Output of `show version`:

admin@sonic:~$ show version

SONiC Software Version: SONiC.DS4101_V2.0.0_1 SONiC OS Version: 11 Distribution: Debian 11.7 Kernel: 5.10.0-23-2-amd64 Build commit: 67ede2e82 Build date: Wed Apr 10 07:05:46 UTC 2024 Built by: autobuild@AZUHPSP10

Platform: x86_64-cel_ds4101-r0 HwSKU: cel_ds4101 ASIC: broadcom ASIC Count: 1 Serial Number: R4030F2B063A25GD200026 Model Number: R4030-F9001-02 Hardware Revision: 6 Uptime: 19:26:13 up 1:38, 2 users, load average: 1.45, 1.52, 1.45 Date: Tue 07 May 2024 19:26:13

Docker images: REPOSITORY TAG IMAGE ID SIZE docker-dhcp-relay latest 43f5d92d0d5f 309MB docker-macsec latest 063cb43bd118 328MB docker-orchagent DS4101_V2.0.0_1 88f1d11531ea 338MB docker-orchagent latest 88f1d11531ea 338MB docker-fpm-frr DS4101_V2.0.0_1 25fef5a4a745 357MB docker-fpm-frr latest 25fef5a4a745 357MB docker-sflow DS4101_V2.0.0_1 7f2b88118acd 327MB docker-sflow latest 7f2b88118acd 327MB docker-nat DS4101_V2.0.0_1 da64867dabff 329MB docker-nat latest da64867dabff 329MB docker-eventd DS4101_V2.0.0_1 1a953e81a85b 300MB docker-eventd latest 1a953e81a85b 300MB docker-snmp DS4101_V2.0.0_1 9d41791169a6 340MB docker-snmp latest 9d41791169a6 340MB docker-platform-monitor DS4101_V2.0.0_1 b1fc016974ff 421MB docker-platform-monitor latest b1fc016974ff 421MB docker-teamd DS4101_V2.0.0_1 d97ecda1ae11 326MB docker-teamd latest d97ecda1ae11 326MB docker-lldp DS4101_V2.0.0_1 0e05f15eea2b 343MB docker-lldp latest 0e05f15eea2b 343MB docker-mux DS4101_V2.0.0_1 252b2ba2def7 349MB docker-mux latest 252b2ba2def7 349MB docker-sonic-gnmi DS4101_V2.0.0_1 2a40110cc077 388MB docker-sonic-gnmi latest 2a40110cc077 388MB docker-database DS4101_V2.0.0_1 79d18c8282dd 301MB docker-database latest 79d18c8282dd 301MB docker-gbsyncd-broncos DS4101_V2.0.0_1 5cea29cd8028 351MB docker-gbsyncd-broncos latest 5cea29cd8028 351MB docker-syncd-brcm DS4101_V2.0.0_1 7e5381b11d8f 675MB docker-syncd-brcm latest 7e5381b11d8f 675MB docker-gbsyncd-credo DS4101_V2.0.0_1 c27a150fd640 323MB docker-gbsyncd-credo latest c27a150fd640 323MB docker-router-advertiser DS4101_V2.0.0_1 28bbd623ce8a 301MB docker-router-advertiser latest 28bbd623ce8a 301MB docker-sonic-mgmt-framework DS4101_V2.0.0_1 303355e5ea69 417MB docker-sonic-mgmt-framework latest 303355e5ea69 417MB

admin@sonic:~$

#### Output of `show techsupport`:

(paste your output here or download and attach the file here )



#### Additional information you deem important (e.g. issue happens only occasionally):

<!--
     Also attach debug file produced by `sudo generate_dump`
-->
neethajohn commented 6 months ago

@prgeor , can you please take a look?

ishidawataru commented 4 months ago

Hi, I reproduced this issue on the VS environment.

The issue was that xcvrd's CmitManagerTask was not taking care of port addition and removal events, that occur from DPB configuration change.

https://github.com/sonic-net/sonic-platform-daemons/pull/500 fixes the issue on my environment.

Since I don't have access to the real hardware that supports DPB and CMIS transceivers, I enhanced the VS environment to support DPB and run xcvrd for reproduction and testing.

@HaiHuang008 It would be great if you could check whether this fix resolves the issue on your platform.

HaiHuang008 commented 4 months ago

@ishidawataru Thanks for your fix, this error no longer appears. But another error appears, causing xcvrd to crash.

image

ishidawataru commented 4 months ago

@HaiHuang008 Thanks for testing and I'm sorry that the original fix didn't work. I think I could fix the problem of the original fix. I updated the PR Could you try again?

HaiHuang008 commented 4 months ago

@ishidawataru Thanks again. This issue is resolved. No xcvrd crash when port breakout. Check_port_breakout.txt

ishidawataru commented 4 months ago

@HaiHuang008 Thanks for testing. I'll try to get the PR merged.