sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
718 stars 1.38k forks source link

Port status not reflected in SONiC #4646

Closed ciju-juniper closed 3 years ago

ciju-juniper commented 4 years ago

Description

This is an issue observed in the latest SONiC images. When the cables are connected/removed in the switch ports, interface status is not correctly reflected in 'show interfaces status'. Link status is updated correctly in the broadcom side. Changes in the link state are not updated in the kernel. The files '/sys/class/net/Ethernet*/carrier' are not updated whenever the OIR is done. After a system reboot, the link status is updated in the kernel and SONiC is able to report the status correctly.

We tried to narrow down this problem to a specific kernel version, but the SONiC builds are broken when we go back to couple of weeks / months due to certain package versions missing. The last commit on which we didn't observe this issue was on Feb-26 and there are quite significant changes happened in 'drivers/net' directory of the kernel in the span of 1.5 months.

Please have a look on this issue and suggest how to debug further.

Testing environment

Switch: QFX5200-32C-S ASIC: TH1 Branch: master

Link is configured with 100G and the DAC cable is connected back to back with the ports in the same switch. No platform specific drivers are loaded apart from ASIC configuration files in 'device' directory.

Here are the logs from problematic image (May 20th Jenkins image)

Trial 1

100G DAC Cable connected to port 0 and port 1 after the box is rebooted.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU            Alias    Vlan    Oper    Admin    Type    Asym PFC
-----------  ---------------  -------  -----  ---------------  ------  ------  -------  ------  ----------
  Ethernet0      49,50,51,52     100G   9100   hundredGigE1/1  routed      up       up     N/A         N/A
  Ethernet4      53,54,55,56     100G   9100   hundredGigE1/2  routed      up       up     N/A         N/A
  Ethernet8      57,58,59,60     100G   9100   hundredGigE1/3  routed    down       up     N/A         N/A
  ....

BCMCMD also shows link up in bcm asic for these 2 ports
root@sonic:/home/admin# bcmcmd ps | grep ce12
      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No
root@sonic:/home/admin# bcmcmd ps | grep ce13
      ce13( 54)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

Also "carrier" parameter is updated in "sys" directory for the ports connected with cables

root@sonic:/home/admin# cat /sys/class/net/Ethernet4/carrier
1
root@sonic:/home/admin# cat /sys/class/net/Ethernet0/carrier
1
Trial 2

100G DAC Cable connected to port 0 and connected to port 31.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU            Alias    Vlan    Oper    Admin    Type    Asym PFC
-----------  ---------------  -------  -----  ---------------  ------  ------  -------  ------  ----------
  Ethernet0      49,50,51,52     100G   9100   hundredGigE1/1  routed      up       up     N/A         N/A
  Ethernet4      53,54,55,56     100G   9100   hundredGigE1/2  routed      up       up     N/A         N/A
  ...
  Ethernet120       9,10,11,12     100G   9100  hundredGigE1/31  routed    down       up     N/A         N/A
  Ethernet124      13,14,15,16     100G   9100  hundredGigE1/32  routed    down       up     N/A         N/A

BCMCMD also shows link up in bcm asic for these 2 ports

root@sonic:/home/admin# bcmcmd ps | grep ce12
      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No
root@sonic:/home/admin# bcmcmd ps | grep ce3
       ce3( 13)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

"carrier" parameter is not updated in "sys" directory for the ports connected with cables, still showing carrier up for port 1

root@sonic:/home/admin# cat /sys/class/net/Ethernet124/carrier
0
root@sonic:/home/admin# cat /sys/class/net/Ethernet4/carrier
1

Here is the dump from problematic image:

sonic_dump_sonic_20200526_071154.tar.gz

Last working commit from master branch: 1ef740361c4ee4e00cadc29fe2228ac5712afefd

Here are the logs from the working kernel image:

Trial 1

DAC cable is connected between physical port-0 & port-1 and system is rebooted.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU            Alias    Vlan    Oper    Admin    Type    Asym PFC
-----------  ---------------  -------  -----  ---------------  ------  ------  -------  ------  ----------
  Ethernet0      49,50,51,52     100G   9100   hundredGigE1/1  routed      up       up     N/A         N/A
  Ethernet4      53,54,55,56     100G   9100   hundredGigE1/2  routed      up       up     N/A         N/A

root@sonic:/home/admin# bcmcmd ps | grep ce12
      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No
root@sonic:/home/admin# bcmcmd ps | grep ce13
      ce13( 54)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

root@sonic:/home/admin# cat /sys/class/net/Ethernet4/carrier
1
root@sonic:/home/admin# cat /sys/class/net/Ethernet0/carrier
1
Trial 2

100G DAC Cable connected to port 0 and connected to port 31.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU            Alias    Vlan    Oper    Admin    Type    Asym PFC
-----------  ---------------  -------  -----  ---------------  ------  ------  -------  ------  ----------
  Ethernet0      49,50,51,52     100G   9100   hundredGigE1/1  routed      up       up     N/A         N/A
  Ethernet4      53,54,55,56     100G   9100   hundredGigE1/2  routed    down       up     N/A         N/A
  .....
  Ethernet120    9,10,11,12      100G   9100  hundredGigE1/31  routed    down       up     N/A         N/A
  Ethernet124    13,14,15,16     100G   9100  hundredGigE1/32  routed      up       up     N/A         N/A

root@sonic:/home/admin# bcmcmd ps | grep ce13
      ce13( 54)  down   4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

root@sonic:/home/admin# bcmcmd ps | grep ce12
      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

root@sonic:/home/admin# bcmcmd ps | grep ce3
       ce3( 13)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No

root@sonic:/home/admin# cat /sys/class/net/Ethernet124/carrier
1
root@sonic:/home/admin# cat /sys/class/net/Ethernet0/carrier
1

Here is the dump from working image:

sonic_dump_sonic_20200522_143911.tar.gz

ciju-juniper commented 4 years ago

@lguohan @jleveque Please have a look and let me know how to proceed further. We are trying on 201911 branch as well. Will update the test results shortly.

ciju-juniper commented 4 years ago

In 201911 branch, issue is not seen.

rlhui commented 4 years ago

@ciju-juniper, it's a little bit not clear what the issue is still. Please make it more clear what is not expected, and make the formatting more clear. Thanks.

ciju-juniper commented 4 years ago

@rlhui When the cables are connected/removed in the switch ports, the link status is not updated in the kernel. The files '/sys/class/net/Ethernet*/carrier' are not updated whenever the OIR is done. Due to this problem, 'show interfaces status' is not reflecting the link status.

lguohan commented 4 years ago

@ciju-juniper , can you provide the dump file after you did trail 2 for the bad image. I need to look at the sairedis log to see if port status notification is generated or not. In your current dump, it seems it is for trail 1.

as you can see, it only has up notification. when you unplug Ethernet4, we should receive a port down notification from SAI, which I cannot find in the sairedis log.

lgh@gulv-vm3:~/sonic_dump_sonic_20200526_071154/log$ grep OPER_STATUS sairedis.rec 
2020-05-26.06:53:11.592558|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-26.06:53:11.592925|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd0000000005ee|SAI_HOSTIF_ATTR_OPER_STATUS=true
2020-05-26.06:53:11.595581|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-26.06:53:11.595802|s|SAI_OBJECT_TYPE_HOSTIF:oid:0xd0000000005ef|SAI_HOSTIF_ATTR_OPER_STATUS=true
ciju-juniper commented 4 years ago

@lguohan The dump was captured after Trial-2. As you can see '/sys/class/net/Ethernet4/carrier' still reports '1' even after the cable pull out.

We can try recreating the problem and capture the dump again. Needs a day or two as we don't have physical access to the box these days.

Could you tell me which driver in the kernel find out the link status change and update the '/sys/class/net/Ethernet4/carrier'?

lguohan commented 4 years ago

the mismatch here is that if you look at the broadcom.ps file, it says ce12 and ce13 up. In the trail-2, i think it should be ce3 to be up according to your good trail log. So, it seems after you swap the cable, ce3 is not up from broadcom.ps file.

that is why I am not sure if you have provide the right capture after trail-2.

      ce12( 50)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No      
      xe36( 51)  !ena   1   25G  FD None  No   Disable          None   FA  XGMII  9412    No      
      xe37( 52)  !ena   1   25G  FD None  No   Disable          None   FA  XGMII  9412    No      
      xe38( 53)  !ena   1   25G  FD None  No   Disable          None   FA  XGMII  9412    No      
      ce13( 54)  up     4  100G  FD   SW  No   Forward          None    F    KR4  9122    No      

ethernet4 is the broadcom knet driver. when there is link down, sai will send a notification to orchagent, and orchagent will call another SAI api to set the carrier down. From your log, i do not see the sai send the notification.

ciju-juniper commented 4 years ago

@lguohan Please see the attached test logs and the dump from the bad image. Test_logs_28_05_2020.txt sonic_dump_sonic_20200528_055104.tar.gz

ciju-juniper commented 4 years ago

@lguohan Look like there are no SAI notifications after Trial-2

crajank-mbp:log crajank$ grep -i "SAI_PORT_OPER_STATUS_" sairedis.rec 
2020-05-28.05:38:10.297727|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-28.05:38:10.300046|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-28.05:40:23.912578|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-05-28.05:40:23.914670|n|port_state_change|[{"port_id":"oid:0x100000000000f","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]|
2020-05-28.05:40:28.943032|n|port_state_change|[{"port_id":"oid:0x1000000000005","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2020-05-28.05:40:28.945893|n|port_state_change|[{"port_id":"oid:0x100000000000e","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
lguohan commented 4 years ago

so, it looks like a broadcom SAI issue. if you bring down the port on the other side, will the SAI generate the notification?

ciju-juniper commented 4 years ago

@lguohan We had tested by connecting DAC cables in loopback within the same box.

Are you asking to do an 'ifconfig Ethernet0 down' in this loopback setup? Or connect the cables between two boxes and check?

lguohan commented 4 years ago

basically, I am asking to simulate the port oper status down and check if sai sends a notification to upper layer or not.

BaluAlluru commented 4 years ago

We tried to simulate the port oper status down by executing below command to make the port 1 down bcmcmd "port ce13 Enable=False".

We are seeing the issue in latest SONIC image. Attached is jenkins290_logs file. This file has the logs captured for latest SONIC Jenkins 290 image on the box. In the attached log, the issue is seen after Trial 5. SAI is not sending notification to upper layer after few iterations. Repeated multiple times the same exercise of disabling the port and enabling the port and checking the SAI REDIS logs. Port State change event logs are not seen in SAI REDIS.

We also loaded "201911" release image on the box and repeated the same exercise. We don't see issue with this image. In this image, we see SAI sends notification to upper layer. we did many iterations of simulating the link status up/down and didn't observe the issue. 201911_logs attached file has logs for reference.

201911_logs.txt jenkins290_logs.txt

ciju-juniper commented 4 years ago

@lguohan What could be the next steps to debug this issue further?

ciju-juniper commented 4 years ago

@lguohan How do we take this issue forward? Could we talk to BRCM to see what is going wrong with SAI?

ciju-juniper commented 4 years ago

@rlhui I see a mail from you regarding SAI 1.6.2 release. Would you know if this problem is fixed in that?

ciju-juniper commented 4 years ago

@smaheshm I see your commit for SAI 1.6.1. Are you aware of this issue?

smaheshm commented 4 years ago

@smaheshm I see your commit for SAI 1.6.1. Are you aware of this issue?

No, not aware of this issue. Let me check the status of 1.6.2 release. Will open a case with BRCM if required.

ciju-juniper commented 4 years ago

@smaheshm We verified this issue with SAI 1.6.1. Problem is still seen. Please open a case with BRCM. This is very easy to re-create and very basic.

smaheshm commented 4 years ago

@smaheshm We verified this issue with SAI 1.6.1. Problem is still seen. Please open a case with BRCM. This is very easy to re-create and very basic.

@ciju-juniper Stay tuned for updated BRCM-SAI debian package. We have to check if the issue persists in the updated debian package and then open a case.

rlhui commented 4 years ago

@ciju-juniper, would you please confirm/clarify if this issue is seen ONLY with cables connecting ports within the same device? Have you tried the same step, but with cables connecting ports of two separate boxes? Thanks.

ciju-juniper commented 4 years ago

@rlhui We had tried with a different switch at the other end. Issue is seen when the remote is down or the cable is pulled out from the other end. Port status is not changed in SONiC and no SAI events detected.

gechiang commented 4 years ago

@ciju-juniper I have tried to reproduce this same issue using master.320 image and flapped the link over 30 times but not able to hit the same issue you reported. May I ask for your help on the following:

  1. If possible, please use the latest master branch image (320 for example) and try to reproduce this issue. If still reproducible, please gather the following information for me: a). drop into BCM shell by executing the following cmd: bcmsh b). Issue the cmd "bsv" to collect the SAI version output. c). Issue the cmd "show unit" to collect the ASIC version output. d). Issue the cmd "ver" to collect the BRCM SDK version output.

  2. Please try using another cable other than the one that you were able to reproduce the issue and see if issue still persists.

  3. Go back to the 201911 image that you used and collect the SONiC image version by doing "show version".

  4. Drop down to BCM shell and collect the same info as I stated above under a) through d).

Please perform the above and collect the info that I requested so that I can see if I can reproduce it and work with BRCM SAI team to solve this issue. Thanks!

ciju-juniper commented 4 years ago

@gechiang Problem is seen on master.320 image. We had tried to recreate the issue with different boxes and different cables. Same behaviour.

Our test is very simple. Initially 100G DAC cable was connected to Ethernet8 & Ethernet12.

root@sonic:/home/admin# cat /var/log/swss/sairedis.rec | grep -i "SAI_PORT_OPER_STATUS_"
2019-02-14.10:12:43.713628|n|port_state_change|[{"port_id":"oid:0x1000000000016","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2019-02-14.10:12:43.716446|n|port_state_change|[{"port_id":"oid:0x1000000000018","port_state":"SAI_PORT_OPER_STATUS_UP"}]|

After that cable was moved to Ethernet16. Now it's connected between Ethernet8 & Ethernet16.

root@sonic:/home/admin# show interfaces status
  Interface            Lanes    Speed    MTU    FEC          Alias    Vlan    Oper    Admin             Type    Asym PFC
-----------  ---------------  -------  -----  -----  -------------  ------  ------  -------  ---------------  ----------
  Ethernet0      73,74,75,76     100G   9100    N/A   hundredGigE1  routed    down       up              N/A         N/A
  Ethernet4      65,66,67,68     100G   9100    N/A   hundredGigE2  routed    down       up              N/A         N/A
  Ethernet8      81,82,83,84     100G   9100    N/A   hundredGigE3  routed      up       up  QSFP28 or later         N/A
 Ethernet12      89,90,91,92     100G   9100    N/A   hundredGigE4  routed      up       up              N/A         N/A
 Ethernet16  105,106,107,108     100G   9100    N/A   hundredGigE5  routed    down       up  QSFP28 or later         N/A

There are no SAI event messages for port down and up events. 'show interface status' still shows ports Ethernet8 & Ethernet12 are up, even though there is no cable connections.

We can see that ports (Ethernet8 & Ethernet16) are up in BCM shell

ce20( 38)  up     4  100G  FD   SW  No   Forward          None    F  CAUI4  9122    No
ce26( 44)  up     4  100G  FD   SW  No   Forward          None    F  CAUI4  9122    No
ciju-juniper commented 4 years ago

@gechiang Here are the o/p that you have asked with master.320 image. This is from a TH2 based platform

admin@sonic:~$ show version

SONiC Software Version: SONiC.master.320-0d809d0d
Distribution: Debian 10.4
Kernel: 4.19.0-6-2-amd64
Build commit: 0d809d0d
Build date: Tue Jun 23 19:03:25 UTC 2020
Built by: johnar@jenkins-worker-4

Platform: x86_64-juniper_qfx5210-r0
HwSKU: Juniper-QFX5210-64C
ASIC: broadcom
Serial Number: YJ0219370007
Uptime: 11:06:03 up 54 min,  2 users,  load average: 1.93, 1.82, 1.73

1.b). Issue the cmd "bsv" to collect the SAI version output.

drivshell>bsv
bsv
BRCM SAI ver: [3.7.4.2], OCP SAI ver: [1.6.0], SDK ver: [6.5.16]

1.c). Issue the cmd "show unit" to collect the ASIC version output.

drivshell>show unit
show unit
Unit 0 chip BCM56970_B0 (current)

1.d). Issue the cmd "ver" to collect the BRCM SDK version output.

drivshell>ver
ver
Broadcom Command Monitor: Copyright (c) 1998-2020 Broadcom
Release: sdk-6.5.16 built 20200604 (Thu Jun  4 18:14:20 2020)
From sonicbld@eaa190965a84:/var/sonicbld/workspace/Build/broadcom/broadcom_sai/20-sai-build-brcm-3.7.4.2/output/x86-xgs5-deb80//sdk/bcmsdk
Platform: X86
OS: Unix (Posix)
Chips:

       BCM56640_A0,
       BCM56850_A0,
       BCM56340_A0,
       BCM56960_A0, BCM56860_A0,

       BCM56970_A0, BCM56870_A0,
       BCM56980_A0, BCM56980_B0,

PHYs:  BCM5400, BCM54182, BCM54185, BCM54180,
    BCM54140, BCM54192, BCM54195, BCM54190,
    BCM54194, BCM54210, BCM54220, BCM54280,
    BCM54282, BCM54240, BCM54285, BCM5428X,
    BCM54290, BCM54292, BCM54294, BCM54295,
    BCM54296, BCM8750, BCM8752, BCM8754,
    BCM84740, BCM84164, BCM84758, BCM84780,
    BCM84784, BCM84318, BCM84328, Sesto,
    copper sfp

2). Please try using another cable other than the one that you were able to reproduce the issue and see if issue still persists. [Ciju] We had tried to recreate the issue with different boxes and different cables. Same behaviour. Problem is seen.

ciju-juniper commented 4 years ago

@gechiang Here are the o/p that you have asked with 201911 based image. This is from a TH2 based platform

root@sonic:/home/admin# show version

SONiC Software Version: SONiC.201911.0-dirty-20200611.093730
Distribution: Debian 9.12
Kernel: 4.9.0-11-2-amd64
Build commit: d9848197
Build date: Thu Jun 11 02:17:54 UTC 2020
Built by: ciju@sonic-server

1.b). Issue the cmd "bsv" to collect the SAI version output.

root@sonic:/home/admin# bcmcmd bsv
bsv
BRCM SAI ver: [3.7.3.3], OCP SAI ver: [1.5], SDK ver: [6.5.16]
drivshell>

1.c). Issue the cmd "show unit" to collect the ASIC version output.

root@sonic:/home/admin# bcmcmd "show unit"
show unit
Unit 0 chip BCM56970_B0 (current)
drivshell>
root@sonic:/home/admin#

1.d). Issue the cmd "ver" to collect the BRCM SDK version output.

root@sonic:/home/admin# bcmcmd "ver"
ver
Broadcom Command Monitor: Copyright (c) 1998-2020 Broadcom
Release: sdk-6.5.16 built 20200417 (Fri Apr 17 02:10:18 2020)
From sonicbld@9cacac0fd10c:/var/sonicbld/workspace/Build/broadcom/broadcom_sai/20-sai-build-brcm-3.7/output/x86-xgs5-deb80//sdk/bcmsdk
Platform: X86
OS: Unix (Posix)
Chips:

       BCM56640_A0,
       BCM56850_A0,
       BCM56340_A0,
       BCM56960_A0, BCM56860_A0,

       BCM56970_A0, BCM56870_A0,
       BCM56980_A0, BCM56980_B0,

PHYs:  BCM5400, BCM54182, BCM54185, BCM54180,
    BCM54140, BCM54192, BCM54195, BCM54190,
    BCM54194, BCM54210, BCM54220, BCM54280,
    BCM54282, BCM54240, BCM54285, BCM5428X,
    BCM54290, BCM54292, BCM54294, BCM54295,
    BCM54296, BCM8750, BCM8752, BCM8754,
    BCM84740, BCM84164, BCM84758, BCM84780,
    BCM84784, BCM84318, BCM84328, Sesto,
    copper sfp

2). Please try using another cable other than the one that you were able to reproduce the issue and see if issue still persists. [Ciju] We had tried to recreate the issue with different boxes and different cables. Same behaviour. Problem is seen.

ciju-juniper commented 4 years ago

@gechiang Here are the o/p that you have asked with master.320 image. This is from a TH1 based platform. Issue is seen on TH2 platform as well.

root@sonic:~# show version

SONiC Software Version: SONiC.master.320-0d809d0d
Distribution: Debian 10.4
Kernel: 4.19.0-6-2-amd64
Build commit: 0d809d0d
Build date: Tue Jun 23 19:03:25 UTC 2020
Built by: johnar@jenkins-worker-4

Platform: x86_64-juniper_qfx5200-r0
HwSKU: Juniper-QFX5200-32C-S
ASIC: broadcom
Serial Number: WD0218170442
Uptime: 07:49:29 up  1:17,  3 users,  load average: 1.78, 1.91, 1.81

1.b). Issue the cmd "bsv" to collect the SAI version output.

drivshell>bsv
bsv
BRCM SAI ver: [3.7.4.2], OCP SAI ver: [1.6.0], SDK ver: [6.5.16]

1.c). Issue the cmd "show unit" to collect the ASIC version output.

drivshell>show unit
show unit
Unit 0 chip BCM56960_B1 (current)

1.d). Issue the cmd "ver" to collect the BRCM SDK version output.

drivshell>ver
ver
Broadcom Command Monitor: Copyright (c) 1998-2020 Broadcom
Release: sdk-6.5.16 built 20200604 (Thu Jun  4 18:14:20 2020)
From sonicbld@eaa190965a84:/var/sonicbld/workspace/Build/broadcom/broadcom_sai/20-sai-build-brcm-3.7.4.2/output/x86-xgs5-deb80//sdk/bcmsdk
Platform: X86
OS: Unix (Posix)
Chips:

       BCM56640_A0,
       BCM56850_A0,
       BCM56340_A0,
       BCM56960_A0, BCM56860_A0,

       BCM56970_A0, BCM56870_A0,
       BCM56980_A0, BCM56980_B0,

PHYs:  BCM5400, BCM54182, BCM54185, BCM54180,
    BCM54140, BCM54192, BCM54195, BCM54190,
    BCM54194, BCM54210, BCM54220, BCM54280,
    BCM54282, BCM54240, BCM54285, BCM5428X,
    BCM54290, BCM54292, BCM54294, BCM54295,
    BCM54296, BCM8750, BCM8752, BCM8754,
    BCM84740, BCM84164, BCM84758, BCM84780,
    BCM84784, BCM84318, BCM84328, Sesto,
    copper sfp
gechiang commented 4 years ago

@ciju-juniper Thank you for providing the detail version information. I was able to reproduce the same issue with our lab DUT that uses the TH1 (BCM56960_B1 ) chip. I have gathered the necessary information and have contacted BRCM to investigate this issue. Will update this thread when I have more information. Thanks!

BaluAlluru commented 4 years ago

issue is also seen on Broadcom TH2 chipset.

ciju-juniper commented 4 years ago

@gechiang @rlhui @smaheshm Could we list this as a must fix for the next SAI release as this is very basic and affect broadcom ASICs?

gechiang commented 4 years ago

@ciju-juniper We are already working on the next SAI release for master branch. I was told that in a few more days it will be pull into the master branch. We have internal 20191130 builds that uses the new SAI release for qualification testing and I did not observe the link bounce missing notification issue on TH1 ASIC. I will let you know if the new SAI becomes available on master branch if that comes before BRCM shares any info in this particular issue that you reported. Thanks!

ciju-juniper commented 4 years ago

@gechiang Are you referring to SONiC 201911 branch? If yes, we have "BRCM SAI ver: [3.7.3.3], OCP SAI ver: [1.5], SDK ver: [6.5.16]" now. Is this version going to be upgraded?

gechiang commented 4 years ago

@ciju-juniper The one that we are qualifying is a derivative of the following version: BRCM SAI ver: [3.7.5.1], OCP SAI ver: [1.5.2], SDK ver: [6.5.16] But this may change for the master branch... So please don't quote me on this when the actual released value differs from the one I have shown you here... We are qualifying this new SAI internally and thus not yet visible in any of the public branches.

ciju-juniper commented 4 years ago

@gechiang Thanks! Version doesn't matter as long as this issue gets fixed.

gechiang commented 4 years ago

@ciju-juniper Yes it is fixed. I just loaded a master branch based private build with the new version of BRCM SAI installed and I repeated the same link flap exercise and all are working fine! BRCM SAI ver: [3.7.5.1], OCP SAI ver: [1.6.3], SDK ver: [6.5.16]

Please wait for the following PR to be approved and merged to master branch. https://github.com/Azure/sonic-buildimage/pull/4847 Thanks!

ciju-juniper commented 4 years ago

@gechiang Can I take the sonic-broadcom.bin image from #4847 and give it a try?

gechiang commented 4 years ago

@ciju-juniper Sure. But please note that this is internal image so use it to validate this particular issue only... Thanks!

ciju-juniper commented 4 years ago

@gechiang We tested the sonic-broadcom.bin built as part if #4847 on a TH2 platform. Issue is still seen.

drivshell>bsv
bsv
BRCM SAI ver: [3.7.5.1], OCP SAI ver: [1.6.3], SDK ver: [6.5.16]
drivshell>show unit
show unit
Unit 0 chip BCM56970_B0 (current)
drivshell>ver
ver
Broadcom Command Monitor: Copyright (c) 1998-2020 Broadcom
Release: sdk-6.5.16 built 20200624 (Wed Jun 24 20:05:41 2020)
From sonicbld@360cbe419987:/var/sonicbld/workspace/Build/broadcom/broadcom_sai/20-sai-build-brcm-3.7.5.1/output/x86-xgs5-deb80//sdk/bcmsdk
Platform: X86
OS: Unix (Posix)
Chips:

       BCM56640_A0,
       BCM56850_A0,
       BCM56340_A0,
       BCM56960_A0, BCM56860_A0,

       BCM56970_A0, BCM56870_A0,
       BCM56980_A0, BCM56980_B0,

PHYs:  BCM5400, BCM54182, BCM54185, BCM54180,
    BCM54140, BCM54192, BCM54195, BCM54190,
    BCM54194, BCM54210, BCM54220, BCM54280,
    BCM54282, BCM54240, BCM54285, BCM5428X,
    BCM54290, BCM54292, BCM54294, BCM54295,
    BCM54296, BCM8750, BCM8752, BCM8754,
    BCM84740, BCM84164, BCM84758, BCM84780,
    BCM84784, BCM84318, BCM84328, Sesto,
    copper sfp
root@sonic:~# show version

SONiC Software Version: SONiC.HEAD.3708-004146f03
Distribution: Debian 10.4
Kernel: 4.19.0-6-2-amd64
Build commit: 004146f03
Build date: Thu Jun 25 02:19:18 UTC 2020
Built by: johnar@jenkins-worker-1

Platform: x86_64-juniper_qfx5210-r0
HwSKU: Juniper-QFX5210-64C
ASIC: broadcom
Serial Number: YJ0219370007
Uptime: 12:18:08 up  2:06,  2 users,  load average: 1.56, 1.65, 1.56
ciju-juniper commented 4 years ago

@gechiang Issue is seen on TH1 platform as well with the built image in #4847

gechiang commented 4 years ago

@ciju-juniper That is quite surprising to me. Not sure why the result is different for you. In my case I clearly see the behavior changed. Let's wait for #4847 merged into the master branch and then try again. If you have access to the latest 201911 branch image, can you try build 105 or above and see if you also see issues with this version?

ciju-juniper commented 4 years ago

@gechiang We verified this issue with Jenkins build-107 of 201911 branch on a TH2 platform. No issues observed while doing OIR. Link status & SAI port up/down events are updated correctly.

root@sonic:/home/admin# show version

SONiC Software Version: SONiC.HEAD.107-d32beffe
Distribution: Debian 9.12
Kernel: 4.9.0-11-2-amd64
Build commit: d32beffe
Build date: Sun Jun 28 22:54:17 UTC 2020
Built by: johnar@jenkins-worker-7
root@sonic:/home/admin# bcmcmd "bsv"
bsv
BRCM SAI ver: [3.7.5.1], OCP SAI ver: [1.5.2], SDK ver: [6.5.16]
drivshell>
root@sonic:/home/admin# bcmcmd "show unit"
show unit
Unit 0 chip BCM56970_B0 (current)
drivshell>
root@sonic:/home/admin# bcmcmd "ver"
ver
Broadcom Command Monitor: Copyright (c) 1998-2020 Broadcom
Release: sdk-6.5.16 built 20200626 (Fri Jun 26 22:57:10 2020)
From sonicbld@bcdff79d8525:/var/sonicbld/workspace/Build/broadcom/broadcom_sai/20-sai-build-brcm-3.7/output/x86-xgs5-deb80//sdk/bcmsdk
Platform: X86
OS: Unix (Posix)
Chips:

       BCM56640_A0,
       BCM56850_A0,
       BCM56340_A0,
       BCM56960_A0, BCM56860_A0,

       BCM56970_A0, BCM56870_A0,
       BCM56980_A0, BCM56980_B0,

PHYs:  BCM5400, BCM54182, BCM54185, BCM54180,
    BCM54140, BCM54192, BCM54195, BCM54190,
    BCM54194, BCM54210, BCM54220, BCM54280,
    BCM54282, BCM54240, BCM54285, BCM5428X,
    BCM54290, BCM54292, BCM54294, BCM54295,
    BCM54296, BCM8750, BCM8752, BCM8754,
    BCM84740, BCM84164, BCM84758, BCM84780,
    BCM84784, BCM84318, BCM84328, Sesto,
    copper sfp

drivshell>

Here is the complete test log with build 107 on 201911 branch: QFX5210_20911_Jenkins107_logs.txt

gechiang commented 4 years ago

@ciju-juniper Thanks for trying it and confirm that with 201911.107 which uses the SAI version BRCM SAI ver: [3.7.5.1], OCP SAI ver: [1.5.2], SDK ver: [6.5.16] It is good that you did not observe the issue. Please note that the master branch is also moving towards a similar version of this SAI. Unfortunately build #330 suppose to have this change but the build failed... Please wait for the next good build in master branch and try it out. I will also do the same as soon as a good build in master branch becomes available and update you as well on my finding.

ciju-juniper commented 4 years ago

@gechiang @lguohan @rlhui @smaheshm Instead of trying out various images, can we try to root cause the problem with SAI? Have you opened any broadcom issue for this? The risk is that even if this issue get resolved by a certain version, it may re-appear in the following releases. So we need to understand the bug in the code and fix it in the right way.

Since we already confirmed that #4847 doesn't solve this, how can we expect the latest master will work fine? It has been already 4 months and we had three new SAI releases without this basic OIR working.

lguohan commented 4 years ago

@gechiang , are we able to repo this issue internally? I heard this issue only apply to 100G port, is it correct? if yes, can we open an issue with broadcom to track this?

gechiang commented 4 years ago

@ciju-juniper I did open a case for this against BRCM previously but due to the SAI version reported was old (3.7.4.2) and I was not able to reproduce it when I moved to use SAI version 3.7.5.1 so I have closed the case. The reason that I have asked you to wait for the next master branch release for this is because the new SAI checked in and merged into the master branch had an additional fix for TH/TH2 ASICs on top of the one that you have tried (using #4847). If that one still fails for you then I can reopen the case and ask for s resolution from BRCM. I am also eagerly waiting for that version to come out on a official master branch image. In order to work with BRCM I need to be able to reproduce the issue and gather the debug info that they may need. I think we are close to get a good build out from Master branch...

gechiang commented 4 years ago

@lguohan I was not able to reproduce the issue with 40G ports platform using the TH ASIC with SAI ver 3.7.4.2. Only on platforms that uses 100G and with SAI version 3.7.4.2 I was able to see SAI hang and not notifying application for the port flap events. internally in SAI the port state was maintained correctly for each link bounce similar to what Juniper reported. I did move on to use higher SAI version 3.7.5.1 and was not able to reproduce this issue with 100G ports platforms thus determined that the issue is fixed with newer SAI version. Once we have a good master branch image I will retest again. The version that Juniper used although is "3.7.5.1" but it was not a complete version as it did not have the BUSY fix for TH/TH2 ASICs. we should have that in the next good master branch image. For some reason the master build is highly unstable these few days and no good image after the new SAI PR is pulled in... The sooner we can get a good master branch build the sooner we can focus back on with this issue as well.

ciju-juniper commented 4 years ago

@gechiang @lguohan I doubt if busy fix (for syncd running more than 100 %) will have any impact for the OIR events. The reason is simple. Even with syncd running at >100% cpu, 201911 branch does correctly reports the port up/down events, whereas master branch doesn't

gechiang commented 4 years ago

@ciju-juniper That is why it is very important to have a good master branch image with the latest SAI so that we can determine if the issue is still seen and whether it is really with SAI or somewhere else... Thanks!

lguohan commented 4 years ago

@gechiang , i just trigger two broadcom master branch build. btw, you should be able to trigger them yourself on the public jenkins. can you check?

ciju-juniper commented 4 years ago

@gechiang Please have a look at the bug history and the debugging steps that we did with @lguohan It was very clear that SAI is not sending the port up/down events. If you have any further suggestions, we can try those as well.

Master branch is highly unstable now. We just found out a crash with xcvrd. Will open an issue shortly. This is caused by not properly updating xcvrd PI code. That means, the latest Jenkins image will not work clean.

We don't want to miss out the fix for this issue in the 202006 release. Our release schedule ( may be other vendor platforms too) is impacted by this issue.