siemens / meta-iot2050

SIMATIC IOT2050 Isar/Debian Board Support Package
MIT License
129 stars 76 forks source link

Update TI bits in linux-iot2050 #411

Closed jan-kiszka closed 1 year ago

jan-kiszka commented 1 year ago

Align the BSP kernel regarding its TI downstream or backport bits with latest ti-linux. Also update the prueth firmware along that. Should resolve #368 and supersedes #374.

This does not including a stable update of the underlying CIP kernel yet as we are waiting for a recent cip-rt release.

attila-hannibal commented 1 year ago

Dear All!

We tried the Siemens reference image (Chao's branch). The eno1 interface was requested to use DHCP. After several power cycles the eno1 couldn't get IP address. Based on the statistics it seems this the well known issue, that the RX packets got stuck at driver level and couldn't reach the higher levels. Long story short: the DHCP responses (RX packets) are lost in the driver and the Linux cannot configure the IP.

The statistics in the ethtool dump shows non-zero RX counters. However ifconfig / netstat shows zero RX bytes: ethtool.txt ifconfig.txt journalctl-b.log netstat.txt

Will try Jan's branch variant

attila-hannibal commented 1 year ago

I tried a clean build from jan/kernel-update branch

The network interfaces don't work at all. After 30 seconds a kernel dump appears on the debug console: ... [ 136.998977] ------------[ cut here ]------------ [ 137.003638] NETDEV WATCHDOG: eno2 (icssg-prueth): transmit queue 0 timed out [ 137.010788] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x314/0x320 [ 137.019038] Modules linked in: ti_am335x_adc kfifo_buf irq_pruss_intc rfkill icssg_prueth pru_rproc icss_iep ptp cp210x pps_core usbserial ti_k3_r5_remoteproc ti_cal videobuf2_dma_contig ti_am335x_tscadc v4l2_fwnode videobuf2_memops pci_endpoint_test videobuf2_v4l2 videobuf2_common pruss at24 optee_rng rng_core fuse ip_tables x_tables ipv6 [ 137.049140] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.10.145-cip17 #1 [ 137.055739] Hardware name: SIMATIC IOT2050 Advanced PG2 (DT) [ 137.061388] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) [ 137.067382] pc : dev_watchdog+0x314/0x320 ... and then endlessly the following comes: ... [ 180.006989] icssg-prueth icssg0-eth eno2: xmit timeout [ 184.870978] icssg-prueth icssg0-eth eno2: xmit timeout [ 189.990981] icssg-prueth icssg0-eth eno2: xmit timeout ...

This version is even worse, please double check at your side

logs attached: console.txt

jan-kiszka commented 1 year ago

Thanks for reporting. I haven't seen this on any boards here so far. Same issue when going two commits backward (70ebcfadac1b6f159640bbb4423d8b9e2d5c12a4)?

jan-kiszka commented 1 year ago

@attila-hannibal please also check if a specific element of your network infrastructure contributes to this (eg. as specific switch, compared to cross-links).

attila-hannibal commented 1 year ago

Hello @jan-kiszka

I tried the mentioned commit 70ebcfa but the behaviour is the same, eno* interfaces don't work, kernel dump appears after 1-2 minutes I had a guess that maybe the TI pruss firmware is too new, so I made a version rollback "08.06.00.001" -> "08.02.00.002", but did not help

jan-kiszka commented 1 year ago

And if you leave out the firmware update completely? In my tests, our current firmware still worked.

BTW, please also explore my other question if your network infrastructure influences this. I have no luck reproducing it, boxes run for hours with all versions.

attila-hannibal commented 1 year ago

By using the binary artefact from: https://github.com/siemens/meta-iot2050/suites/10438058123/artifacts/516591580

The only change besides the root password config at first startup I modified the /etc/network/interfaces file:

# interfaces(5) file used by ifup(8) and ifdown(8) # Include files from /etc/network/interfaces.d: source /etc/network/interfaces.d/*

auto eno1 iface eno1 inet dhcp

auto eno2 iface eno2 inet static address 192.168.214.230 netmask 255.255.255.0

interfaces does not work, kernel dumps after while note: during the ping test I pulled out then plugged in the UTP cable console.txt

jan-kiszka commented 1 year ago

Ok, we can rule out build issues on your side.

Some more things to rule out still: please have a look at my other suggestions.

attila-hannibal commented 1 year ago

Using the Chao's branch as source The iot2050 eno2 was directly connected to a workstation PC using static IP's

The ifconfig shows no RX data eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.214.230 netmask 255.255.255.0 broadcast 192.168.214.255 inet6 fe80::8ef3:19ff:fe6c:ee42 prefixlen 64 scopeid 0x20 ether 8c:f3:19:6c:ee:42 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2872 bytes 809966 (790.9 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

The dmesg has the "icssg-prueth icssg0-eth eno2: timeout waiting for command done" line

So the error is present with simple point-to-point connection

jan-kiszka commented 1 year ago

Please don't change two variables a the same time (here: sources AND network setup).

Would still like to see

In addition, please confirm

attila-hannibal commented 1 year ago

After the many clean recompilation I recognized that we may use a different SW than the "Chao's branch", it should be an older variant of your master branch. I guess in the past the image to be written was iot2050-image-example-iot2050-debian-iot2050.wic.img (with .img suffix) and the current build creates an other filename without .img suffix. When I did the git branching and started the build, the old file remained in the build directory and I burned the old one to the USB stick. The reason I'm thinking the kernel variant is 5.10.104-cip3 in the logs that was already replaced in October and both of your dev branches use 5.10.145 and the master, too. Sorry it was my mistake

I have the same Ethernet ports non-working issue with the Chao branch, too.

Let me example our test setups, we have two iot2050 instances

will do the logs you requested later.

attila-hannibal commented 1 year ago

we may start an other approach. When the "stalled Ethernet port" problem happens we have this log in the dmesg: "icssg-prueth icssg0-eth eno2: timeout waiting for command done". This error is written in the "emac_set_port_state()" function: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/drivers/net/ethernet/ti/icssg_config.c?h=ti-linux-5.10.y#n580 The function sends 4 commands (each command is an uint32_t) to the R30 / R32 register addresses, then polls the registers if the TI firmware on the other end cleared them or not. When any of the 4 registers is not cleared the driver prints this error, this is we get randomly.

So based the symptom it seems the TI firmware running on the PRU got stock somehow and cannot handle the commands.

Do you know if we can have some debug possibility to verify if the firmware running on the PRU is "healthy" or have some issues?

attila-hannibal commented 1 year ago

we may start an other approach. When the "stalled Ethernet port" problem happens we have this log in the dmesg: "icssg-prueth icssg0-eth eno2: timeout waiting for command done". This error is written in the "emac_set_port_state()" function: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/drivers/net/ethernet/ti/icssg_config.c?h=ti-linux-5.10.y#n580 The function sends 4 commands (each command is an uint32_t) to the R30 / R32 register addresses, then polls the registers if the TI firmware on the other end cleared them or not. When any of the 4 registers is not cleared the driver prints this error, this is we get randomly.

So based the symptom it seems the TI firmware running on the PRU got stock somehow and cannot handle the commands.

Do you know if we can have some debug possibility to verify if the firmware running on the PRU is "healthy" or have some issues?

This a command register dump when the problem occurred: the first value should be 0xffff0000 (EMAC_NONE), too, but it stays 0xffbb0000

root@DM8CF3196AE6B5:~# devmem2 0xb0005ac
/dev/mem opened.
Memory mapped at address 0xffffbebe5000.
Read at address  0x0B0005AC (0xffffbebe55ac): 0xFFBB0000
root@DM8CF3196AE6B5:~# ^C
root@DM8CF3196AE6B5:~# devmem2 0xb0005b0
/dev/mem opened.
Memory mapped at address 0xffffb2770000.
Read at address  0x0B0005B0 (0xffffb27705b0): 0xFFFF0000
root@DM8CF3196AE6B5:~# devmem2 0xb0005b4
/dev/mem opened.
Memory mapped at address 0xffff809b4000.
Read at address  0x0B0005B4 (0xffff809b45b4): 0xFFFF0000
root@DM8CF3196AE6B5:~# devmem2 0xb0005b8
/dev/mem opened.
Memory mapped at address 0xffff90003000.
Read at address  0x0B0005B8 (0xffff900035b8): 0xFFFF0000
jan-kiszka commented 1 year ago

I'm trying my contacts to TI. Maybe we will get some further hints how to analyze this best.

attila-hannibal commented 1 year ago

Some further testing result: SW level: jan/kernel-update, this determines the TI pruss firmware: 08.06.00.001

test scenarios: 1) eno1: dhcp, eno2: static, UTP cables connected at startup -> none of the interfaces are working 2) eno1: dhcp, eno2: static, UTP cables not connected startup, after login connect eno1, then eno2 -> eno1 does not work, eno2 works 3) eno1: dhcp, eno2: static, UTP cables not connected startup, after login connect eno2, then eno1 -> eno2 works, eno1 does not work 4) eno1: dhcp, eno2: static, only eno2 UTP cables is connected at startup -> eno2 works 5) eno1: static, eno2: dhcp, UTP cables connected at startup -> both interfaces are working 6) eno1: unconfigured, eno2: static, UTP cables connected at startup, after login configure eno1 (command: "ifconfig eno1 0.0.0.0 0.0.0.0 && dhclient") -> both interfaces are working 7) eno1: dhcp, eno2: static, UTP cables connected at startup + using TI pruss firmware: 08.02.00.002 -> none of the interfaces are working 8) eno1: dhcp, eno2: static, UTP cables connected at startup + using TI pruss firmware: 08.00.00.004 -> none of the interfaces are working

I think we can narrow down the issue: during the boot when the eno1 is connected already and configured as dhcp, there should be some kind of deadlock that causes the issue. It seems only eno1 has such a behaviour (see scenario 5) ), also when the interface configured "late" (see scenario 6) ) it works. That may answers why our Linux image has the non-persistent behaviour, because something depends on the timing and/or boot step sequence.

the 4 command registers for scenario 1) seems all fine

root@iot2050-debian:~# memtool md 0xb0005ac
0b0005ac: ffff0000 ffff0000 ffff0000 ffff0000                ................
0b0005bc: 70020000 00000000 00000000 00000000                ...p............
0b0005cc: 00000000 00000000 00000000 00000000                ................
applea9 commented 1 year ago

@attila-hannibal Do you have tested with scenario that both eno1 and eno2 are static? I met issue with this scenario before.

BaochengSu commented 1 year ago

@jan-kiszka I found the commit 4a80b17a317debfced45d2b44dbfb1008343e29b really helps a lot from the rebase suffering, however, irrelevant to this issue.

So I've just extracted it to a separate PR #414.

BaochengSu commented 1 year ago

Some further testing result: SW level: jan/kernel-update, this determines the TI pruss firmware: 08.06.00.001

test scenarios:

  1. eno1: dhcp, eno2: static, UTP cables connected at startup -> none of the interfaces are working
  2. eno1: dhcp, eno2: static, UTP cables not connected startup, after login connect eno1, then eno2 -> eno1 does not work, eno2 works
  3. eno1: dhcp, eno2: static, UTP cables not connected startup, after login connect eno2, then eno1 -> eno2 works, eno1 does not work
  4. eno1: dhcp, eno2: static, only eno2 UTP cables is connected at startup -> eno2 works
  5. eno1: static, eno2: dhcp, UTP cables connected at startup -> both interfaces are working
  6. eno1: unconfigured, eno2: static, UTP cables connected at startup, after login configure eno1 (command: "ifconfig eno1 0.0.0.0 0.0.0.0 && dhclient") -> both interfaces are working
  7. eno1: dhcp, eno2: static, UTP cables connected at startup + using TI pruss firmware: 08.02.00.002 -> none of the interfaces are working
  8. eno1: dhcp, eno2: static, UTP cables connected at startup + using TI pruss firmware: 08.00.00.004 -> none of the interfaces are working

I think we can narrow down the issue: during the boot when the eno1 is connected already and configured as dhcp, there should be some kind of deadlock that causes the issue. It seems only eno1 has such a behaviour (see scenario 5) ), also when the interface configured "late" (see scenario 6) ) it works. That may answers why our Linux image has the non-persistent behaviour, because something depends on the timing and/or boot step sequence.

the 4 command registers for scenario 1) seems all fine

root@iot2050-debian:~# memtool md 0xb0005ac
0b0005ac: ffff0000 ffff0000 ffff0000 ffff0000                ................
0b0005bc: 70020000 00000000 00000000 00000000                ...p............
0b0005cc: 00000000 00000000 00000000 00000000                ................

Hi @attila-hannibal,

We've tried these scenarios with the action build https://github.com/siemens/meta-iot2050/actions/runs/3948486240, however none is reproduced.

So there might be something nuanced different between our setup and yours which lead to the non-producible. So it would be helpful if we have below information (some of them just a double confirmation to make sure we are on the same page):

attila-hannibal commented 1 year ago

Some further testing result: SW level: jan/kernel-update, this determines the TI pruss firmware: 08.06.00.001 test scenarios:

  1. eno1: dhcp, eno2: static, UTP cables connected at startup -> none of the interfaces are working
  2. eno1: dhcp, eno2: static, UTP cables not connected startup, after login connect eno1, then eno2 -> eno1 does not work, eno2 works
  3. eno1: dhcp, eno2: static, UTP cables not connected startup, after login connect eno2, then eno1 -> eno2 works, eno1 does not work
  4. eno1: dhcp, eno2: static, only eno2 UTP cables is connected at startup -> eno2 works
  5. eno1: static, eno2: dhcp, UTP cables connected at startup -> both interfaces are working
  6. eno1: unconfigured, eno2: static, UTP cables connected at startup, after login configure eno1 (command: "ifconfig eno1 0.0.0.0 0.0.0.0 && dhclient") -> both interfaces are working
  7. eno1: dhcp, eno2: static, UTP cables connected at startup + using TI pruss firmware: 08.02.00.002 -> none of the interfaces are working
  8. eno1: dhcp, eno2: static, UTP cables connected at startup + using TI pruss firmware: 08.00.00.004 -> none of the interfaces are working

I think we can narrow down the issue: during the boot when the eno1 is connected already and configured as dhcp, there should be some kind of deadlock that causes the issue. It seems only eno1 has such a behaviour (see scenario 5) ), also when the interface configured "late" (see scenario 6) ) it works. That may answers why our Linux image has the non-persistent behaviour, because something depends on the timing and/or boot step sequence. the 4 command registers for scenario 1) seems all fine

root@iot2050-debian:~# memtool md 0xb0005ac
0b0005ac: ffff0000 ffff0000 ffff0000 ffff0000                ................
0b0005bc: 70020000 00000000 00000000 00000000                ...p............
0b0005cc: 00000000 00000000 00000000 00000000                ................

Hi @attila-hannibal,

We've tried these scenarios with the action build https://github.com/siemens/meta-iot2050/actions/runs/3948486240, however none is reproduced.

So there might be something nuanced different between our setup and yours which lead to the non-producible. So it would be helpful if we have below information (some of them just a double confirmation to make sure we are on the same page):

  • The reproduce rate for these scenarios. From your description I get a feeling that it was very easy to reproduce the issue within the scenarios, i.e. you don't have to perform lots of reboot to trigger the issue.
  • Which hardware version? We were trying on two PG2 advanced iot2050, with 1YA2 and FS: 04,
  • FW version in your setup. We've tried with two different firmware, one is the released v01.03.01 firmware, the other is the firmware built from the above action.
  • Is there any additional peripheral such as PCIE card, DP monitor, etc. in your setup? A picture of your device setup could be more helpful. We only have the usb-sd-card reader and the FTDI uart cable connected in our setup.
  • What is the network segments for both eno1 and eno2? We are using two different segments for static (192.168.200.0/24) and dhcp(192.168.1.0/24).
  • Is there any other images in the eMMC or SD card slot? This is just in case to avoid accidentally booting to another image. You can check the current boot target via fw_printenv | grep boot_targets, or you can check the image build ID from /etc/os-release file to make sure the current booting image is with build id fd691b7.
  • For the static profile that directly connected to the PC, what kind of Ethernet port are you using on the PC side, is it the PC native port or some USB network adapter? We've tried both.
  • How do you define not working in the issue? is it not pinging? or DHCP IP not showing up? or static IP losing? We were using the ping command to determine the working/non-working.
  • What is your DHCP environment? We think it is your company DHCP?
  • What is the network tool you are using? We are using nmtui and we think you are using the /etc/network ?

Hello @BaochengSu

I used the iot2050-example-image.zip as wic image from the build you linked in. I copied it to a 64GB large USB3 stick. I also replaced the SPI flash content with the iot2050-pg2-image-boot.bin from the build to have the same FW/SW as you have. content of the /etc/network/interfaces files is the same as before: root@iot2050-debian:~# cat /etc/network/interfaces source /etc/network/interfaces.d/*

auto eno1 iface eno1 inet dhcp

auto eno2 iface eno2 inet static address 192.168.214.230 netmask 255.255.255.0

(no other change has been made to you reference image) Answering your question 1-by-1:

  1. the issue with the networking is persistent now. I tried 10 restarts and all the 10 were faulty. The failure is explained in point 8.

  2. I'm attaching a picture from the sticker on the machine. we use the same HW variant sticker

  3. I use now the one from the build artifacts

  4. nothing special, I'm attaching a picture. I use the pendrive as boot source / rootfs. The eMMC is erased. I have the 2 UTP cables connected to the networking ports, the debug UART and the 24V power supply 20230202_142844

  5. the eno1 (DHCP) connection should obtain an IP address from network range: 10.23.0.** , but it does not work, see 8. the eno2 (static) uses 192.168.214., see above the /etc/network/interfaces file

  6. There is no SD card at all, the eMMC is erased, the system boots from USB stick. I'm attaching the os-release file content: release.txt the build is correct!

  7. the static connection use an U-Green USB-Ethernet adapter, that has an "AX88179 Gigabit Ethernet" chip on it.

  8. "The problem" - that happens always when I use your latest image(s) - the eno1 DHCP connection does not get any IP address, nothing at all. Using the eno2 port the iot2050 cannot ping my laptop (laptop ip: 192.168.214.1), I executed ping on both machines, tcpdump on my machine shows no ARP response: 14:43:48.621038 ARP, Request who-has 192.168.214.230 tell 192.168.214.1, length 28 14:43:49.643279 ARP, Request who-has 192.168.214.230 tell 192.168.214.1, length 28 14:43:50.668756 ARP, Request who-has 192.168.214.230 tell 192.168.214.1, length 28 14:43:51.690955 ARP, Request who-has 192.168.214.230 tell 192.168.214.1, length 28 14:43:52.715215 ARP, Request who-has 192.168.214.230 tell 192.168.214.1, length 28 tcpdump on the iot2050 is attached in the console log: debug_console.txt on the iot2050 it seems the ARP responses are sent but is not visible on the other side. After a while kernel dump happens and I get the "icssg-prueth icssg0-eth eno2: xmit timeout" periodically NOTE: as written before if the eno1 <-> eno2 are swapped in the /etc/network/interface files (and of course I swap the UTP cables). the problem does not appear. So please reproduce exactly the same network setup as I have.

  9. The DHCP is provided by the company I work for. What information do you need? See point 8.'s note: if I swap the eno1 and eno2 the DHCP works. Also there are about ~40 machines on this network those are working flawlessly, therefore I think nothing wrong with the DHCP environment.

  10. I use the systemd's built-in. See the timeout in the debug logs ( A start job is running for Raise network interfaces...)

Best regards Attila

jan-kiszka commented 1 year ago

Wait, I missed that so far:

content of the /etc/network/interfaces files is the same as before: root@iot2050-debian:~# cat /etc/network/interfaces source /etc/network/interfaces.d/*

auto eno1 iface eno1 inet dhcp

auto eno2 iface eno2 inet static address 192.168.214.230 netmask 255.255.255.0

We are using Network Manger in the default image. I'm sure if /etc/network/interface.d is properly evaluated at all. And even if: there is also /etc/NetworkManager/system-connections/eno1-default.

attila-hannibal commented 1 year ago

Hello @jan-kiszka

Ok, I confirm the "persistent network issue" is solved by using the correct network-manager. I cleaned the /etc/network/interfaces file and configured the /etc/NetworkManager/system-connections/ connection-files via the nmtui tool, the system works as expected.

Now we have to go back to our original problem when the network loss happened randomly, and when this problem happens we have the "icssg-prueth icssg0-eth eno2: timeout waiting for command done" in the dmesg log

jan-kiszka commented 1 year ago

Great to hear. Hope the image can now help validating if that issue is gone as well. Please let us know when there are news or further findings/questions.

lyxsiemens commented 1 year ago

After several power cycles the eno1 couldn't get IP address

@attila-hannibal Could you tell me how did you do the power cycle? by "reboot" command or reset button or power cut?

jan-kiszka commented 1 year ago

@BaochengSu, @AsuraZeng, can we proceed with the MR? It addresses what #374 was fixing, avoids related regressions, and aligns with latest ti-linux (now 08.06.004, as recommended by T). It may just not resolve all issues of the prueth, but that should be shared with ti-linux at this point.

BaochengSu commented 1 year ago

Given that TI is about to release the 8.6 SDK soon, I am ok to proceed with 8.6 catchup.