nxp-archive / openil_sja1105-tool

The sja1105-tool is a Linux userspace application for configuring the NXP SJA1105 Automotive Ethernet L2 switch.
BSD 3-Clause "New" or "Revised" License
25 stars 20 forks source link

Bandwidth problem with sja1105 port #47

Open Meng0527 opened 5 years ago

Meng0527 commented 5 years ago

Hi,I found a problem when I tried the Qbv demo. I connected two hosts through a sja1105 and found that the bandwidth between them is unstable. The bandwidth is sometimes close to 1000M, but sometimes it is only about 500M. The switch is set to the default tsn configuration. When I tried to connect two hosts via two sja1105, the situation got worse and the bandwidth was only about 10M. Do you know the reason for this problem?

vladimiroltean commented 5 years ago

The default configuration for Qbv does not permit 1000 Mbps bandwidth for best-effort traffic anyway. However, per the IEEE spec, when the Qbv engine is not running, all gates should be open, therefore all bandwidth is available for regular traffic. And the ptp4l program only starts the Qbv engine when the PTP offset is small enough. So in this case, perhaps maybe the issue is when the best-effort bandwidth is full, not when it isn't? May I suggest that your situation might be re-stated as "PTP time sync is not stable and the Qbv engine is stopping"? If you think this is not the case, can you please provide port counters (sja1105-tool status port) so we can investigate possible frame drops?

Meng0527 commented 5 years ago

My description may be confusing. I did not enable Qbv when testing bandwidth, as a comparison test of Qbv demo. Now the bandwidth is normal again, I will give feedback the next time this problem happens.

Meng0527 commented 5 years ago

I think I know the reason for the sja1105 bandwidth change.When this happens, there are some errors on the ingress port of the sja1150 that cause frame drop, so the tester cannot get the correct bandwidth value. The dropped frames account for about one thousandth of the total. The error is mainly CRCERR, also some SOFERR and MIIERR.When testing only one sja1105, this happens occasionally, and when connecting it to another sja1105 or a normal switch to test, the problem will always occur. Here is the port counter of the ingerss port. MAC-Level Diagnostic Counters N_RUNT 0 N_SOFERR 123 N_ALIGNERR 0 N_MIIERR 255

MAC-Level Diagnostic Flags TYPEERR 0 SIZEERR 0 TCTIMEOUT 0 PRIORERR 0 NOMASTER 0 MEMOV 0 MEMERR 0 INVTYP 0 INTCYOV 0 DOMERR 0 PCFBAGDROP 0 SPCPRIOR 0 AGEPRIOR 0 PORTDROP 0 LENDROP 0 BAGDROP 0 POLICEERR 0 DRPNON664ERR 0 SPCERR 0 AGEDRP 0

High-Level Diagnostic Counters N_N664ERR 0 N_VLANERR 0 N_UNRELEASED 0 N_SIZERR 0 N_CRCERR 6735 N_VLNOTFOUND 0 N_BEPOLERR 0 N_POLERR 0 N_RXFRM 353977 N_RXBYTE 181234516 N_TXFRM 15 N_TXBYTE 1404 N_QFULL 0 N_PART_DROP 0 N_EGR_DISABLED 0 N_NOT_REACH 0

vladimiroltean commented 5 years ago

Thank you for the investigation done so far. Do you still have a system running with this packet loss issue? Could you please tell me the output of the following:

source /etc/init.d/S46sja1105-link-speed-fixup
etsec_mdio write 3 0x1c $((3 << 10)) && etsec_mdio read 3 0x1c
etsec_mdio write 3 0x18 $(((7 << 12) | 7)) && etsec_mdio read 3 0x18

May I also know which port number these errors are seen on? I need this info for some further commands.

Meng0527 commented 5 years ago

This is the result of a recent test.These errors are seen on port 0 (eth5). [root@OpenIL:~]# sja1105-tool status port 0 Port 0

MAC-Level Diagnostic Counters N_RUNT 0 N_SOFERR 6 N_ALIGNERR 0 N_MIIERR 21

MAC-Level Diagnostic Flags TYPEERR 0 SIZEERR 0 TCTIMEOUT 0 PRIORERR 0 NOMASTER 0 MEMOV 0 MEMERR 0 INVTYP 0 INTCYOV 0 DOMERR 0 PCFBAGDROP 0 SPCPRIOR 0 AGEPRIOR 0 PORTDROP 0 LENDROP 0 BAGDROP 0 POLICEERR 0 DRPNON664ERR 0 SPCERR 0 AGEDRP 0

High-Level Diagnostic Counters N_N664ERR 0 N_VLANERR 0 N_UNRELEASED 0 N_SIZERR 0 N_CRCERR 657 N_VLNOTFOUND 0 N_BEPOLERR 0 N_POLERR 0 N_RXFRM 667782 N_RXBYTE 341902336 N_TXFRM 55 N_TXBYTE 5498 N_QFULL 0 N_PART_DROP 0 N_EGR_DISABLED 0 N_NOT_REACH 0

Meng0527 commented 5 years ago

[root@OpenIL:init.d]# ./S46sja1105-link-speed-fixup start Setting ETH2 link speed to 1000 Setting ETH3 link speed to 1000 Setting ETH4 link speed to 1000 Setting ETH5 link speed to 1000 [root@OpenIL:init.d]# etsec_mdio write 3 0x1c $((3 << 10)) && etsec_mdio read 3 0x1c 0xe00 [root@OpenIL:init.d]# etsec_mdio write 3 0x18 $(((7 << 12) | 7)) && etsec_mdio read 3 0x18 0x71e7

vladimiroltean commented 5 years ago

Can you please further run the following commands after you observe the RGMII errors? You should run them once before the frame errors occur, and once afterwards (the reason is that the counters get cleared upon read):

source /etc/init.d/S46sja1105-link-speed-fixup
etsec_mdio read 6 0x11
etsec_mdio read 6 0x12
etsec_mdio read 6 0x13
etsec_mdio read 6 0x1A
etsec_mdio write 6 0x17 $((0xf00 | 1)) && etsec_mdio read 6 0x15

Also, what would it take for me to try to reproduce this? How many cables do you have connected to the switch? Is the temperature higher than usual? It happens even when the link partner is another LS1021A-TSN switch port, right? Are both boards connected to the same ground reference? Are the PHY LEDs still on when this issue happens? Does it happen on a single board/single port?

Meng0527 commented 5 years ago

Before sending the test stream: Port 0

MAC-Level Diagnostic Counters N_RUNT 0 N_SOFERR 0 N_ALIGNERR 0 N_MIIERR 0 High-Level Diagnostic Counters N_CRCERR 0 N_RXFRM 0 N_RXBYTE 0 N_TXFRM 91 N_TXBYTE 17874

[root@OpenIL:]# /etc/init.d/S46sja1105-link-speed-fixup start Setting ETH2 link speed to 1000 Setting ETH3 link speed to 1000 Setting ETH4 link speed to 1000 Setting ETH5 link speed to 1000 [root@OpenIL:]# etsec_mdio read 6 0x11 0x321 [root@OpenIL:]# etsec_mdio read 6 0x12 0x0 [root@OpenIL:]# etsec_mdio read 6 0x13 0xff [root@OpenIL:]# etsec_mdio read 6 0x1A 0xc3e [root@OpenIL:]# etsec_mdio write 6 0x17 $((0xf00 | 1)) && etsec_mdio read 6 0x15 0x0

Meng0527 commented 5 years ago

After sending the test stream: Port 0

MAC-Level Diagnostic Counters N_RUNT 0 N_SOFERR 255 N_ALIGNERR 0 N_MIIERR 255

High-Level Diagnostic Counters N_CRCERR 15789 N_RXFRM 3232540 N_RXBYTE 1655060480 N_TXFRM 3 N_TXBYTE 222

[root@OpenIL:etc]# /etc/init.d/S46sja1105-link-speed-fixup start Setting ETH2 link speed to 1000 Setting ETH3 link speed to 1000 Setting ETH4 link speed to 1000 Setting ETH5 link speed to 1000 [root@OpenIL:etc]# etsec_mdio read 6 0x11 0x2321 [root@OpenIL:etc]# etsec_mdio read 6 0x12 0x0 [root@OpenIL:etc]# etsec_mdio read 6 0x13 0xff [root@OpenIL:etc]# etsec_mdio read 6 0x1A 0x2c3e [root@OpenIL:etc]# etsec_mdio write 6 0x17 $((0xf00 | 1)) && etsec_mdio read 6 0x15 0x0

Meng0527 commented 5 years ago

The following is a simplified topology: image image 1.Usual temperature. 2.Yes. 3.Yes. 4.The PHY LEDs are still on. 5.It sometimes happens on a single board.

vladimiroltean commented 5 years ago

Are you performing any SPI transactions to any of the switches when this is happening? Or are the systems simply idling and passing traffic?

Meng0527 commented 5 years ago

No,I do nothing with it when I sent the test stream.

vladimiroltean commented 5 years ago

The PHY counters I asked you to read are indicating that bad start-of-stream delimiters have been found in received frames since the last readout. So whatever the SJA1105 port is seeing, the PHY is seeing too. You have shown two diagrams above. In both of them, the tester is connected to ETH4 and ETH5. However, the ETH4/ETH5 pair is also used in the second diagram to interconnect two LS1021A-TSN boards. Then you are showing a list of counters for SJA1105 port 0, which is confusingly ETH5. What is the link partner of the port that's seeing bad SSD frames? Always the tester, always the LS1021A-TSN, or both?

May I know what the tester is testing for? Frame preemption, by any chance? Does the tester have the ability to decode raw Ethernet code words? Do you have a capture of the frames that trigger the bad SSD error? What is the structure of the test stream?

Meng0527 commented 5 years ago

The ETH5 connected to the tester (LS1021ATSN in Figure 1 and LS1021ATSN-1 in Figure 2) sometimes sees packet loss,the ETH5 (LS1021ATSN-2 in Figure 2)connected to LS1021ATSN always sees. The counter list values of the ports which packets are lost in different diagrams are very close,so I only show one. The tester only performs basic parameter testing (bandwidth, delay, etc.) and without involving any TSN functions. The frame of the test stream is an Ethernet frame with a length of 512 bytes and broadcast. Test frames captured.zip

vladimiroltean commented 5 years ago

Have you made any progress with this? I am not able to confirm the behavior with traffic based on your PCAP, or provide other debugging hints. Is your switch configuration XML different from the standard?

jihe123 commented 4 years ago

Hello, i am doing demo with one tsn board(LS1021ATSN), according to the pdf(Open Industrial Linux User Guide Release v0.2),but when i did the schedule configuration (6.8.6),there's something wrong,just like this: [root@OpenIL:~]# sja1105-tool conf mod schedule-table entry-count 2 [root@OpenIL:sja1105]# for i in 0 1; do sja1105-tool conf mod schedule-table[$i] \destports 0b00100;done Index out of bounds! Please adjust the entry count of the table: