sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
706 stars 1.35k forks source link

[xcvrd] [cmis manager] CMIS manager cannot automatically select correct host lane count when selecting module application #19336

Open Junchao-Mellanox opened 1 month ago

Junchao-Mellanox commented 1 month ago

Description

When CMIS manager is enabled, following configuration will cause port link down:

speed: 100G lanes: 0,1,2,3,4,5,6,7 module supported application: 100GAUI-2, 400GAUI-8

CMIS manager will deduce host lane count 8 from "0,1,2,3,4,5,6,7", and it will try to find an application by using speed 100G and host lane count 8. It cannot find a proper application because the supported application is 100G 4 lanes.

CMIS manager should be smart enough to automatically choose 100GAUI-2 via 100G lane 2.

A workaround for this issue is to set lanes to:

lanes: 0,1

But there is no CLI to set port lanes.

Steps to reproduce the issue:

  1. Say we have a port speed=400G, lanes="0,1,2,3,4,5,6,7", the link is up
  2. Change port speed to 100G: config interface speed EthernetX 100G
  3. link is down

Describe the results you received:

Link is down. And error log:

Jun 18 11:06:25.350031 sonic ERR pmon#xcvrd: CMIS: Ethernet240: no suitable app for the port appl None host_lane_count 8 host_speed 100000

Describe the results you expected:

xcvrd should be able to automatically choose the best application possible by using the current speed and a subset of the lanes.

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

ishidawataru commented 1 month ago

If we have a port and a corresponding transceiver with the following configuration and capability, which app should xcvrd choose for the port?

speed: 100G lanes: 0,1,2,3,4,5,6,7 module supported application: 100GAUI-2, 100GAUI-4

Junchao-Mellanox commented 4 weeks ago

This is a good question to discuss. From my POV, we should try lane number 8->4->2->1. So, I would prefer 100GAUI-4 in this case. @prgeor, @mihirpat1 , what do you think?

ishidawataru commented 4 weeks ago

Does configuring the breakout setting solve the problem?

As for the example you showed,

speed: 100G lanes: 0,1,2,3,4,5,6,7 module supported application: 100GAUI-2, 400GAUI-8

How about setting the breakout configuration like below?

$ config interface breakout Ethernet240 "4x100G"
prgeor commented 3 weeks ago

@ishidawataru CMISmanager will select the application based upon what is there in the config DB's PORT table. Please share your CONFIG_DB'S PORT table dump here for 100G speed.

ishidawataru commented 3 weeks ago

@prgeor The current CMIS manager implementation searches for a module application that matches both speed and host lane counts. @Junchao-Mellanox is pointing out this behavior causes port link down with the following configuration and needs improvement.

When CMIS manager is enabled, following configuration will cause port link down:

speed: 100G lanes: 0,1,2,3,4,5,6,7 module supported application: 100GAUI-2, 400GAUI-8

I initially agreed and implemented https://github.com/sonic-net/sonic-platform-daemons/pull/507 as a draft PR. However, after that, I realized that we can change the host lane counts by configuring DPB, which should also fix the problem without any modification to the current implementation.

Currently, I'm waiting for @Junchao-Mellanox's response.

Junchao-Mellanox commented 3 weeks ago

Hi @ishidawataru , DPB is not a perfect solution for this. As far as I know, DPB has many limitations, for example, it cannot automatically adjust other port related configuration when doing DPB. Also, it is not user friendly to ask sonic user to do an extra DPB configuration when hit this.

ishidawataru commented 2 weeks ago

@Junchao-Mellanox Does the SAI require any modification to support this? What happens when the port is configured as 100G with lanes 0,1,2,3,4,5,6,7 on the platform with 50G/lane for example? Will lanes 0 and 1 be used in this case?

Junchao-Mellanox commented 2 weeks ago

Hi @ishidawataru , it depends on how vendor implement this. Currently, I don't see a problem on nvidia platform regarding SAI.

ishidawataru commented 2 weeks ago

@Junchao-Mellanox How does the NVIDIA SAI choose the lanes to use for a speed configuration if there are multiple choices? When the switch ASIC supports multiple lane speeds, I think we have the same problem that I mentioned above for the switch ASIC.

If we have a port and a corresponding transceiver with the following configuration and capability, which app should xcvrd choose for the port?

speed: 100G lanes: 0,1,2,3,4,5,6,7 module supported application: 100GAUI-2, 100GAUI-4

Does the NVIDIA SAI choose the lane counts as you mentioned?

From my POV, we should try lane number 8->4->2->1.

If that is the case, does it make sense to spec this behavior in SAI so that the xcvrd implementation can work with non-NVIDIA SAI?

Junchao-Mellanox commented 2 weeks ago

Hi @ishidawataru , this is not a problem for ASIC side configuration. User has the ability to choose how many lanes shall be used by ASIC. Here is the sonic config:

config interface type Ethernet0 CR2 (2 means 2 lanes)
ishidawataru commented 2 weeks ago

@Junchao-Mellanox I see, in that case, can xcvrd use that configuration as a hint to choose the module application?