open-switch / opx-nas-l3

https://openswitch.net
1 stars 9 forks source link

Around 10% traffic loss observed after running for few hours #26

Closed waliulislam closed 6 years ago

waliulislam commented 6 years ago

Set up: I have a 5 node topology . CORE (CR) connected to two AGGREGATION (AG2 and AG2) devices. 2 TOR (TR1 and TR2) connected to those AG switches. TR1 is here DUT and it is S5148 platform. Rest of the devices are different platforms. Each AG deivice has one 16 port lag to CR1 and two 8 port lag to two TR device. There are 64 l3 vlan between CR to each AG1 and 32 l3 vlan between AG to each TR. FRR are used to provide l3 routing. Ixia connected to CR1 send 16k BGP routes and Ixia connected to both TR have 4200 host across 10 vlan. There are by directional traffic from hosts of both TR to host on those 16k prefix. Also there are bidirectional traffic between hosts on TRs.

All Ixia ports are 10g. and I am running at 80% capacity. After running the traffic initially ports received traffic as expected. But after running for a while around 10% of total traffic received by S5148 from IXIA get lost. Traffic loss observed in both directions towards core and towards TR2. Here is the port stat after sending 100000 packet (1000 pkt/s) towards TR2

Ingrss:

(xpShell):linkMgr)get_stats 0 3 | grep RxOk
RxOk 0x186a1 TxOK 0x91
(xpShell):linkMgr)get_stats 0 3

Input Arguments are devId=0
-------------
Port#: 3
-------------
RxOk 0x186a1 TxOK 0x9c
RxAll 0x186a1 TxAll 0x9c
RxOctGood 0x4772b18 TxOctGood 0x3bd7
RxOct 0x4772b18 TxOct 0x3bd7
RxUC 0x186a1 TxUC 0x0
RxMC 0x0 TxMC 0x9b
RxBC 0x0 TxBC 0x1
RxPause 0x0 TxPause 0x0
RxPriPause 0x0 TxPriPause 0x0
RxLen<64 0x0 TxLen<64 0x0
RxLen64 0x1 TxLen64 0x1
Rx65-127 0x881 Tx65-127 0x96
Rx128-255 0x2698 Tx128-255 0x0
Rx256-511 0x4ce4 Tx256-511 0x5
Rx512-1023 0x9948 Tx512-1023 0x0
Rx1024-1518 0x715b Tx1024-1518 0x0
Rx1519-2047 0x0 Tx1519-2047 0x0
Rx2048-4095 0x0 Tx2048-4095 0x0
Rx4096-8191 0x0 Tx4096-8191 0x0
Rx8192-9215 0x0 Tx8192-9215 0x0
Rx9216-Up 0x0 Tx9216-Up 0x0
RxPriNum0 0x0 TxPriNum0 0x0
RxPriNum1 0x0 TxPriNum1 0x0
RxPriNum2 0x0 TxPriNum2 0x0
RxPriNum3 0x0 TxPriNum3 0x0
RxPriNum4 0x0 TxPriNum4 0x0
RxPriNum5 0x0 TxPriNum5 0x0
RxPriNum6 0x0 TxPriNum6 0x0
RxPriNum7 0x0 TxPriNum7 0x0
RxPr0Pau1us 0x0 TxPr0Pau1us 0x0
RxPr1Pau1us 0x0 TxPr1Pau1us 0x0
RxPr2Pau1us 0x0 TxPr2Pau1us 0x0
RxPr3Pau1us 0x0 TxPr3Pau1us 0x0
RxPr4Pau1us 0x0 TxPr4Pau1us 0x0
RxPr5Pau1us 0x0 TxPr5Pau1us 0x0
RxPr6Pau1us 0x0 TxPr6Pau1us 0x0
RxPr7Pau1us 0x0 TxPr7Pau1us 0x0
RxFrmAnyEr 0x0 TxErr 0x0
RxFCSEr 0x0 TxVlan 0x97
rxtoolong 0x0 FsigCrcErr 0x0
RxCrcErrInv 0x0 FrmTruncat 0x0
RxFifoFull 0x0 RxStdPau1us 0x0
RxLenEr 0x0 RxUndSize 0x0
RxOverSize 0x0 RxFragment 0x0
RxJabFram 0x0 RxInvPream 0x0

Egress:
(xpShell):linkMgr)get_stats 0 13 | grep TxOK
RxOk 0x5a TxOK 0x188b
(xpShell):linkMgr)get_stats 0 15-19 | grep TxOK
RxOk 0x68 TxOK 0x1896
RxOk 0x68 TxOK 0x17a4
RxOk 0x68 TxOK 0x1982
RxOk 0x67 TxOK 0x1897
RxOk 0x67 TxOK 0x1897
(xpShell):linkMgr)get_stats 0 23-24 | grep TxOK
RxOk 0x6d TxOK 0x1898
RxOk 0x6d TxOK 0x1987
(xpShell):linkMgr)get_stats 0 44-47 | grep TxOK
RxOk 0x2c4 TxOK 0x1565
RxOk 0xf5 TxOK 0x138e
RxOk 0xf4 TxOK 0x165a
RxOk 0xf8 TxOK 0x156c
(xpShell):linkMgr)get_stats 0 60-63 | grep TxOK
RxOk 0x105 TxOK 0x170a
RxOk 0x2a6 TxOK 0x1667
RxOk 0x105 TxOK 0x148a
RxOk 0x106 TxOK 0x139d
atanu-mandal commented 6 years ago

The issue is not reproduced anywhere currently, closing this bug for now. We can reopen if the issue is seen again.