Bandwidth problem with sja1105 port

Meng0527 commented 5 years ago

Hi,I found a problem when I tried the Qbv demo. I connected two hosts through a sja1105 and found that the bandwidth between them is unstable. The bandwidth is sometimes close to 1000M, but sometimes it is only about 500M. The switch is set to the default tsn configuration. When I tried to connect two hosts via two sja1105, the situation got worse and the bandwidth was only about 10M. Do you know the reason for this problem?

vladimiroltean commented 5 years ago

The default configuration for Qbv does not permit 1000 Mbps bandwidth for best-effort traffic anyway. However, per the IEEE spec, when the Qbv engine is not running, all gates should be open, therefore all bandwidth is available for regular traffic. And the ptp4l program only starts the Qbv engine when the PTP offset is small enough. So in this case, perhaps maybe the issue is when the best-effort bandwidth is full, not when it isn't? May I suggest that your situation might be re-stated as "PTP time sync is not stable and the Qbv engine is stopping"? If you think this is not the case, can you please provide port counters (sja1105-tool status port) so we can investigate possible frame drops?

Meng0527 commented 5 years ago

My description may be confusing. I did not enable Qbv when testing bandwidth, as a comparison test of Qbv demo. Now the bandwidth is normal again, I will give feedback the next time this problem happens.

Meng0527 commented 5 years ago

I think I know the reason for the sja1105 bandwidth change.When this happens, there are some errors on the ingress port of the sja1150 that cause frame drop, so the tester cannot get the correct bandwidth value. The dropped frames account for about one thousandth of the total. The error is mainly CRCERR, also some SOFERR and MIIERR.When testing only one sja1105, this happens occasionally, and when connecting it to another sja1105 or a normal switch to test, the problem will always occur. Here is the port counter of the ingerss port. MAC-Level Diagnostic Counters N_RUNT 0 N_SOFERR 123 N_ALIGNERR 0 N_MIIERR 255

MAC-Level Diagnostic Flags TYPEERR 0 SIZEERR 0 TCTIMEOUT 0 PRIORERR 0 NOMASTER 0 MEMOV 0 MEMERR 0 INVTYP 0 INTCYOV 0 DOMERR 0 PCFBAGDROP 0 SPCPRIOR 0 AGEPRIOR 0 PORTDROP 0 LENDROP 0 BAGDROP 0 POLICEERR 0 DRPNON664ERR 0 SPCERR 0 AGEDRP 0

High-Level Diagnostic Counters N_N664ERR 0 N_VLANERR 0 N_UNRELEASED 0 N_SIZERR 0 N_CRCERR 6735 N_VLNOTFOUND 0 N_BEPOLERR 0 N_POLERR 0 N_RXFRM 353977 N_RXBYTE 181234516 N_TXFRM 15 N_TXBYTE 1404 N_QFULL 0 N_PART_DROP 0 N_EGR_DISABLED 0 N_NOT_REACH 0

vladimiroltean commented 5 years ago

Thank you for the investigation done so far. Do you still have a system running with this packet loss issue? Could you please tell me the output of the following:

source /etc/init.d/S46sja1105-link-speed-fixup
etsec_mdio write 3 0x1c $((3 << 10)) && etsec_mdio read 3 0x1c
etsec_mdio write 3 0x18 $(((7 << 12) | 7)) && etsec_mdio read 3 0x18

May I also know which port number these errors are seen on? I need this info for some further commands.

Meng0527 commented 5 years ago

This is the result of a recent test.These errors are seen on port 0 (eth5). [root@OpenIL:~]# sja1105-tool status port 0 Port 0

MAC-Level Diagnostic Counters N_RUNT 0 N_SOFERR 6 N_ALIGNERR 0 N_MIIERR 21

MAC-Level Diagnostic Flags TYPEERR 0 SIZEERR 0 TCTIMEOUT 0 PRIORERR 0 NOMASTER 0 MEMOV 0 MEMERR 0 INVTYP 0 INTCYOV 0 DOMERR 0 PCFBAGDROP 0 SPCPRIOR 0 AGEPRIOR 0 PORTDROP 0 LENDROP 0 BAGDROP 0 POLICEERR 0 DRPNON664ERR 0 SPCERR 0 AGEDRP 0

High-Level Diagnostic Counters N_N664ERR 0 N_VLANERR 0 N_UNRELEASED 0 N_SIZERR 0 N_CRCERR 657 N_VLNOTFOUND 0 N_BEPOLERR 0 N_POLERR 0 N_RXFRM 667782 N_RXBYTE 341902336 N_TXFRM 55 N_TXBYTE 5498 N_QFULL 0 N_PART_DROP 0 N_EGR_DISABLED 0 N_NOT_REACH 0

Meng0527 commented 5 years ago

[root@OpenIL:init.d]# ./S46sja1105-link-speed-fixup start Setting ETH2 link speed to 1000 Setting ETH3 link speed to 1000 Setting ETH4 link speed to 1000 Setting ETH5 link speed to 1000 [root@OpenIL:init.d]# etsec_mdio write 3 0x1c $((3 << 10)) && etsec_mdio read 3 0x1c 0xe00 [root@OpenIL:init.d]# etsec_mdio write 3 0x18 $(((7 << 12) | 7)) && etsec_mdio read 3 0x18 0x71e7

vladimiroltean commented 5 years ago

Can you please further run the following commands after you observe the RGMII errors? You should run them once before the frame errors occur, and once afterwards (the reason is that the counters get cleared upon read):

source /etc/init.d/S46sja1105-link-speed-fixup
etsec_mdio read 6 0x11
etsec_mdio read 6 0x12
etsec_mdio read 6 0x13
etsec_mdio read 6 0x1A
etsec_mdio write 6 0x17 $((0xf00 | 1)) && etsec_mdio read 6 0x15

Also, what would it take for me to try to reproduce this? How many cables do you have connected to the switch? Is the temperature higher than usual? It happens even when the link partner is another LS1021A-TSN switch port, right? Are both boards connected to the same ground reference? Are the PHY LEDs still on when this issue happens? Does it happen on a single board/single port?

Meng0527 commented 5 years ago

Before sending the test stream: Port 0

MAC-Level Diagnostic Counters N_RUNT 0 N_SOFERR 0 N_ALIGNERR 0 N_MIIERR 0 High-Level Diagnostic Counters N_CRCERR 0 N_RXFRM 0 N_RXBYTE 0 N_TXFRM 91 N_TXBYTE 17874

[root@OpenIL:]# /etc/init.d/S46sja1105-link-speed-fixup start Setting ETH2 link speed to 1000 Setting ETH3 link speed to 1000 Setting ETH4 link speed to 1000 Setting ETH5 link speed to 1000 [root@OpenIL:]# etsec_mdio read 6 0x11 0x321 [root@OpenIL:]# etsec_mdio read 6 0x12 0x0 [root@OpenIL:]# etsec_mdio read 6 0x13 0xff [root@OpenIL:]# etsec_mdio read 6 0x1A 0xc3e [root@OpenIL:]# etsec_mdio write 6 0x17 $((0xf00 | 1)) && etsec_mdio read 6 0x15 0x0

Meng0527 commented 5 years ago

After sending the test stream: Port 0

MAC-Level Diagnostic Counters N_RUNT 0 N_SOFERR 255 N_ALIGNERR 0 N_MIIERR 255

High-Level Diagnostic Counters N_CRCERR 15789 N_RXFRM 3232540 N_RXBYTE 1655060480 N_TXFRM 3 N_TXBYTE 222

[root@OpenIL:etc]# /etc/init.d/S46sja1105-link-speed-fixup start Setting ETH2 link speed to 1000 Setting ETH3 link speed to 1000 Setting ETH4 link speed to 1000 Setting ETH5 link speed to 1000 [root@OpenIL:etc]# etsec_mdio read 6 0x11 0x2321 [root@OpenIL:etc]# etsec_mdio read 6 0x12 0x0 [root@OpenIL:etc]# etsec_mdio read 6 0x13 0xff [root@OpenIL:etc]# etsec_mdio read 6 0x1A 0x2c3e [root@OpenIL:etc]# etsec_mdio write 6 0x17 $((0xf00 | 1)) && etsec_mdio read 6 0x15 0x0

Meng0527 commented 5 years ago

The following is a simplified topology: 1.Usual temperature. 2.Yes. 3.Yes. 4.The PHY LEDs are still on. 5.It sometimes happens on a single board.

vladimiroltean commented 5 years ago

Are you performing any SPI transactions to any of the switches when this is happening? Or are the systems simply idling and passing traffic?

Meng0527 commented 5 years ago

No，I do nothing with it when I sent the test stream.

vladimiroltean commented 5 years ago

The PHY counters I asked you to read are indicating that bad start-of-stream delimiters have been found in received frames since the last readout. So whatever the SJA1105 port is seeing, the PHY is seeing too. You have shown two diagrams above. In both of them, the tester is connected to ETH4 and ETH5. However, the ETH4/ETH5 pair is also used in the second diagram to interconnect two LS1021A-TSN boards. Then you are showing a list of counters for SJA1105 port 0, which is confusingly ETH5. What is the link partner of the port that's seeing bad SSD frames? Always the tester, always the LS1021A-TSN, or both?

May I know what the tester is testing for? Frame preemption, by any chance? Does the tester have the ability to decode raw Ethernet code words? Do you have a capture of the frames that trigger the bad SSD error? What is the structure of the test stream?

Meng0527 commented 5 years ago

The ETH5 connected to the tester (LS1021ATSN in Figure 1 and LS1021ATSN-1 in Figure 2) sometimes sees packet loss,the ETH5 (LS1021ATSN-2 in Figure 2)connected to LS1021ATSN always sees. The counter list values of the ports which packets are lost in different diagrams are very close,so I only show one. The tester only performs basic parameter testing (bandwidth, delay, etc.) and without involving any TSN functions. The frame of the test stream is an Ethernet frame with a length of 512 bytes and broadcast. Test frames captured.zip

vladimiroltean commented 5 years ago

Have you made any progress with this? I am not able to confirm the behavior with traffic based on your PCAP, or provide other debugging hints. Is your switch configuration XML different from the standard?

jihe123 commented 4 years ago

Hello, i am doing demo with one tsn board(LS1021ATSN), according to the pdf(Open Industrial Linux User Guide Release v0.2),but when i did the schedule configuration (6.8.6),there's something wrong,just like this: [root@OpenIL:~]# sja1105-tool conf mod schedule-table entry-count 2 [root@OpenIL:sja1105]# for i in 0 1; do sja1105-tool conf mod schedule-table[$i] \destports 0b00100;done Index out of bounds! Please adjust the entry count of the table:

config modify entry-count ) modify failed! Index out of bounds! Please adjust the entry count of the table:
config modify
entry-count
) modify failed! ,i am new to this,do you know what is wrong?Thank you.

elsinkior commented 4 years ago

As part of the demonstration of the TSN functionality of the LS1021ATSN-PA card embedding the OS Open-ILv1.7 - Xenomai / cobalt v3.1-devel, I followed the procedure specified by the document " Open Industrial Linux User Guide, Rev 1.6 08/2019 "for this hardware (chapter 7.2) after having set up the network topology presented in chapter 7.2.2 page 114 (3 LS1021ATSN-PA cards linked together).

I encountered a problem from the first step, when setting up a standard configuration (expected results covered by chapter 7.2.8.5.4).

The bandwidth obtained from board 2 to 3 and from board 1 to 3 is "chaotic", very far from 950 mbits / s.

I used the command line "sja1105-tool status port" on each board and I observed an incrementing of the N_MIIERR counters (port 1 for board 1, port 1 and 2 for board 2 and port 2 for board 3) while iPerf3 running (source 172.15.0.1 destination 127.15.0.3). Bandwidth drops rapidly (over 90%) and oscillates around 10 mbits / s. Same issue for the 2 to 3 test board.

This test was performed in TCP. In UDP, the problem is less obvious, but with a loss of 50%.

In addition, for the "Rate-Limiting - Prioritizing configuration" scenario with the implementation of priorities (flow 1 to 3 priority over flow 2 to 3), I saw (test in UDP) an inversion of bandwidths ( 1 to 3 around 100mbits / s with 2 to 3 around 500mbits / s for 5s then inversion ...).

Regarding the tests on the implementation of the "Synchronized Qbv" demonstration, despite the bandwidth problem, I could observe the expected result for the "3-HOP" scenario (stable latency 30 ms). On the other hand, the "1 HOP" scenario leads to an inconsistent result: unstable latency around 15 ms.

Do you have some idea of investigation to submit to us or an idea on the origin of the problem to help us set up a representative demonstration?

In addition to my issue description, you can find hereafter a test which read, during iPerf3 running, the control register of the PHY3 and PHY2 provided by the BCM56514R.

The collision test bit appears to have been mounted about twenty times for the board where I monitored it, for PORT 2 ETH3 connected to BOARD 1 and for PORT1 ETH2 connected to board 3.

The test steps are described hereafter.

Could you indicates me more information about the read registers used (https://github.com/openil/sja1105-tool/issues/47#issuecomment-491216774) (I didn't find registers specification for the BCM56514R)

1) B3 : Start iPerf3 server

[root@OpenIL:~]# iperf3 -1 -f m -i 0.5 -s -p 5202

2) B1 : Start iPerf3 client

[root@OpenIL:~]# iperf3 -t 86400 -p 5202 -c 172.15.0.3 ... [ 4] 6.00-6.46 sec 512 KBytes 9.09 Mbits/sec 28 1.41 KBytes

3) B2 : Start registries read looper

while true; do etsec_mdio read 3 0x0; done | tee /tmp/outP1_ETH2; while true; do etsec_mdio read 4 0x0; done | tee /tmp/outP2_ETH3;

4) B1 : Stop iPerf3 client

5) B2 : Stop registries read looper

6) B2 : Read port status

PORT 2

|| MAC-Level Diagnostic Counters || || N_RUNT 0 || || N_SOFERR 3 || || N_ALIGNERR 0 || || N_MIIERR 5 ||

PORT 3

|| MAC-Level Diagnostic Counters || || N_RUNT 0 || || N_SOFERR 2 || || N_ALIGNERR 0 || || N_MIIERR 117 ||

7) B2 : Count number of colision test bit rising edge

[root@OpenIL:~]# grep 11e1 /tmp/outP1_ETH2 | wc -l 17

[root@OpenIL:~]# grep 11e1 /tmp/outP2_ETH3 | wc -l 19

Thank you for your feedback,
vladimiroltean commented 4 years ago
Hi there,

I'm sorry for the trouble and I'm also aware of the NXP support ticket you have opened. The N_MIIERR counter is described in UM10944 PDF as:

This field counts the number of frames that started with a valid start sequence (preamble plus SOF delimiter byte) but terminated with the MII error input being asserted.

So it is perhaps indicative of a hardware issue (misconfiguration or otherwise): the PHY has either asserted the RX_ER signal, or deasserted the RX_DV signal of the switch's MAC. We have not seen this manifest during development or testing.

The unfortunate part is that the default LS1021A-TSN image is not equipped with software for proper debugging for this kind of issue. The sja1105-tool being a user space driver, it does not register net devices in the kernel, so it cannot register with the PHY library, to get a driver in control of the BCM5464R or cannot even perform any sort of MDIO access towards the PHY. This cannot be changed given that sja1105-tool is what it is (a user space driver).

What the etsec_mdio script does is more of a hack: it copies what the kernel driver does (drivers/net/ethernet/freescale/fsl_pq_mdio.c) and does that from a shell script with raw access to the MDIO controller registers, via devmem. But since the kernel MDIO driver is also running, and the PHY library is polling the 2 AR8031 PHYs for eth0 and eth1 once per second, the results are not completely defined, since MDIO access is not atomic with respect to memory writes in the controller's register map. So the devmem commands might (and will) interfere with the kernel driver doing its work, and vice versa. So I would not blindly trust the 0x11e1 value that you got 17 times in MII_BMCR.

I think that even unbinding eth0 and eth1 would be enough to get a more reliable read:
```
echo soc:ethernet@2d10000 > /sys/bus/platform/drivers/fsl-gianfar/unbind
echo soc:ethernet@2d50000 > /sys/bus/platform/drivers/fsl-gianfar/unbind
```
But if there is a PHY configuration issue, that would still be difficult to spot with raw MDIO accesses. The BCM5464R PHY, in the default OpenIL setup, is left with mostly the defaults configured via pin strapping, with the exception of link speeds which are forced to 1000 in the /etc/init.d/S46sja1105-link-speed-fixup init script. That file is provided in case you need to change the PHY fixed speed if you have a link partner that runs at 100 Mbps. Since you are not in that situation, I would disable the init script altogether, since in theory it is possible that that, too, interferes with the kernel MDIO driver and, as a result, writes something else to the PHY than what is expected.

The lack of a PHY driver was one of the main reasons for moving sja1105 to a kernel driver, and if you are willing to spend some time, then it would be helpful if you could give the mainline kernel a try. There, the switch ports are registered as swp2, swp3, swp4, swp5, and the MAC statistics can be retrieved with ethtool -S swp2 (it's the same information, but has the advantage that the drivers/net/phy/broadcom.c file gets engaged in configuring the BCM5464R). There is even a fork of OpenIL that enables the mainline kernel for this board. The steps for compiling the image (make nxp_ls1021atsn_defconfig && make) are the same.

Hope this helps, -Vladimir
elsinkior commented 4 years ago

Thank you very much for your reply. I will continue the investigation taking into account your advice :)
elsinkior commented 4 years ago
Thank you very much for your advices.

After your recommendations, I followed the following steps:
- rebuild Open-IL (https://github.com/vladimiroltean/openil-community.git), in order to have a driver that manages the MDIO interface with the broadcom PHY attached to the SJA1105.
- new iPerf3 test between B1 and B3, bandwidth measurement, reading of PHY errors counters directly via ethtool -S swp2
Unfortunately, I also observe the same issue, with disastrous bandwidth:

"ethtool -S swp2" result:

NIC statistics:
```
tx_packets: 253860
tx_bytes: 314867536
rx_packets: 109783
rx_bytes: 6632319
n_runt: 0
**n_soferr: 17**
n_alignerr: 0
**n_miierr: 94**
```
"ifconfig" result:

swp2 Link encap: Ethernet HWaddr 00: 04: 9F: EF: 05: 05
```
UP BROADCAST RUNNING MULTICAST MTU: 1500 Metric: 1
RX packets: 109469 errors: 0 **dropped: 6626** overruns: 0 frame: 0
TX packets: 252486 errors: 0 dropped: 0 overruns: 0 carrier: 0
collisions: 0 txqueuelen: 1000
RX bytes: 6599238 (6.2 MiB) TX bytes: 314766316 (300.1 MiB)
```
Some phytool execution on PHY connected to TSN swicth port 2 and 1

phytool read swp2/3/0x12 0x0003

phytool read swp3/4/0x12 0x0020

phytool read swp3/4/0x12 0x004f

A hardware issue seems to be the cause of the errors encountered with our three boards.
vladimiroltean commented 4 years ago

Thanks for the work investigating this. I have observed some Ethernet PHY issues being correlated with booting the board with the microUSB cable not plugged in. I haven't found the reason for that. Just thought I'd make sure you have those plugged. It's rather strange to have 3 boards fail in the same way, when mostly everybody else hasn't seen that happen.

vladimiroltean commented 4 years ago

Does the PHY report receive errors from other link partners too? What happens if you change the link speed with "ethtool -s swp2 advertise 0x8" (for 100 Mbps, or 0x20 to go back to 1 Gbps)? Could the cables be an issue?
- © Githubissues.
- Githubissues is a development platform for aggregating issues.

nxp-archive / openil_sja1105-tool

Bandwidth problem with sja1105 port #47