siemens / meta-iot2050

SIMATIC IOT2050 Isar/Debian Board Support Package
MIT License
129 stars 76 forks source link

Reboot problem #440

Open rFond opened 1 year ago

rFond commented 1 year ago

Hi,

When IOT works for a very long period (60 day) and I launch a reboot with cmd sudo reboot.

IOT can't reboot, with software, I force to do an hard reset electrical. `======================= Board: IOT2050-ADVANCED Serial: xx MLFB: 6ES7647-0BA00-1YA2 UUID: 8069DFE5E3D14C05B9B8D6C4CAC1CEF0 A5E: A5E452229880AB08 MAC[0]: xx MAC[1]: xx SKU: SE Loading PK... ok PK count: 00 PK version: 00 SV: 00-00 Security ID 0x5125e53f-0x842e215d Security policy: soft Loading image atf... Loading image tee... Loading image spl... Loading image k3-am65-iot2050-spl.dtb... NOTICE: BL31: v2.5(release): NOTICE: BL31: Built : 14:41:24, May 17 2021 I/TC: I/TC: OP-TEE version: 3.12.0 (gcc version 10.2.1 20210110 (Debian 10.2.1-6)) #1 Wed Jan 20 17:48:48 UTC 2021 aarch64 I/TC: Primary CPU initializing I/TC: Primary CPU switching to normal world boot

U-Boot SPL 2021.04-V01.02.01.03-0-g7e29ca7 (Nov 04 2021 - 07:46:39 +0000) SYSFW ABI: 3.1 (firmware rev 0x0015 '21.5.0--v2021.05 (Terrific Llam') Trying to boot from SPI QSPI: QSPI is still busy after poll for 10000 times. SF: Calibration failed (read) cadence_spi spi@47040000: Cannot set speed (err=-5) SPI probe failed. SPL: failed to boot from all boot devices

ERROR ### Please RESET the board

`

Thanks for your help.

rFond commented 1 year ago

All message bugstart_siemens.txt

jan-kiszka commented 1 year ago

Is that error persistent then, or is the error gone again after a cold reset e.g.?

rFond commented 1 year ago

After a cold reset (with button ou electrical), error no gone again.

After.

Board: IOT2050-ADVANCED Serial: NNEM7528 MLFB: 6ES7647-0BA00-1YA2 UUID: 8069DFE5E3D14C05B9B8D6C4CAC1CEF0 A5E: A5E452229880AB08 MAC[0]: xx MAC[1]: xx SKU: SE Loading PK... ok PK count: 00 PK version: 00 SV: 00-00 Security ID 0x5125e53f-0x842e215d Security policy: soft Loading image atf... Loading image tee... Loading image spl... Loading image k3-am65-iot2050-spl.dtb... NOTICE: BL31: v2.5(release): NOTICE: BL31: Built : 14:41:24, May 17 2021 I/TC: I/TC: OP-TEE version: 3.12.0 (gcc version 10.2.1 20210110 (Debian 10.2.1-6)) #1 Wed Jan 20 17:48:48 UTC 2021 aarch64 I/TC: Primary CPU initializing I/TC: Primary CPU switching to normal world boot

U-Boot SPL 2021.04-V01.02.01.03-0-g7e29ca7 (Nov 04 2021 - 07:46:39 +0000) SYSFW ABI: 3.1 (firmware rev 0x0015 '21.5.0--v2021.05 (Terrific Llam') Trying to boot from SPI

U-Boot 2021.04-V01.02.01.03-0-g7e29ca7 (Nov 04 2021 - 07:46:39 +0000)

Model: SIMATIC IOT2050 Advanced DRAM: 2 GiB WDT: Not starting MMC: sdhci@4f80000: 1, sdhci@4fa0000: 0 Loading Environment from SPIFlash... SF: Detected w25q128 with page size 256 Bytes, erase size 64 KiB, total 16 MiB OK In: serial Out: serial Err: serial Net: No ethernet found. Hit any key to stop autoboot: 0 IOT2050> run bootcmd_mmc1 switch to partitions #0, OK mmc1(part 0) is current device Scanning mmc 1:1... Found U-Boot script /boot/boot.scr 732 bytes read in 9 ms (79.1 KiB/s)

Executing script at 80000000

Loading /usr/lib/linux-image-5.10.162-cip24/ti/k3-am6548-iot2050-advanced.dtb... 49996 bytes read in 13 ms (3.7 MiB/s) Loading /boot/vmlinux-5.10.162-cip24... 18770432 bytes read in 205 ms (87.3 MiB/s)

Flattened Device Tree blob at 88000000

Booting using the fdt blob at 0x88000000 Loading Device Tree to 000000008fff0000, end 000000008ffff34b ... OK

Starting kernel ...

jan-kiszka commented 1 year ago

Ok, at least no bricked devices, still ugly. Seen this once so far only, or is this reproducible? And then really only after 60 days?

@BaochengSu are we aware of any such issue already? Looks to me like imperfect re-initialization of the QSPI controller in U-Boot.

BaochengSu commented 1 year ago

QA reported such issue long before, but this is hard to reproduce then hard to debug... Very likely as you suggested something happened in the u-boot driver.

rFond commented 1 year ago

Yes it's little random, I noticed though if we are greater than 40-60d we have this problem. It bothers me because on the IOT in remote control (vpn or other) I sometimes launch reboots. If IOT no restart I have to call a technician. I try to avoid running reboots.

I can send you my IOT image.

jan-kiszka commented 1 year ago

@rFond are you already on the latest firmware version we released?

BaochengSu commented 1 year ago

@rFond And what is the re-producing rate from your side? If it's still little, then I don't think it will help to have your image.

Anyway, thanks for the reporting and supporting, we will definitely look into this issue from our side.

rFond commented 1 year ago

@jan-kiszka Not yet. I can test no problem. I have a development/test bench.

@BaochengSu Reboot on all my park (50 devices), and device with more than 40-60d operating time. The problem is old because the first version (IOT in debian 10) I have already this problem

BaochengSu commented 1 year ago

@rFond This issue has nothing to do with the image, but the firmware version. The firmware version you are currently running is kind of old, the latest version is 1.3.1.x, you can download the firmware updating package from SIOS page.

Also if possible, you can use our latest master to build the latest firmware to see if such issue is still existing. (Please just ignore the error printing in optee regarding the RPMB, that is a known issue and I am on it now, it should not block you from using it.)

rFond commented 1 year ago

@BaochengSu Yes no problem, for moment I can't download file :

You cannot download this file, since your registration inquiry is still being processed.

Please note: due to the manual review, the registration may take several days.You will get a confirmation e-mail as soon as your download registration is finalized.You can download the software packages with export restrictions only when you have received the positive confirmation.

Thanks for your help.

BaochengSu commented 1 year ago

@BaochengSu Yes no problem, for moment I can't download file :

You cannot download this file, since your registration inquiry is still being processed.

Please note: due to the manual review, the registration may take several days.You will get a confirmation e-mail as soon as your download registration is finalized.You can download the software packages with export restrictions only when you have received the positive confirmation.

Thanks for your help.

@bergmanu , can you comment on this?

rFond commented 1 year ago

@BaochengSu It's ok, I'have upgrade firmware on my test bench. I wait and return to you later.

jan-kiszka commented 11 months ago

Any updates?

rFond commented 11 months ago

Hi, not yet, my test bench was disconnected (electrical) so I couldn't do my test :'(

rFond commented 9 months ago

Hi, I'm come back with a result :

SIMATIC IOT2050 SE-Boot Version: D01.02.02.08-0-gf12580b7-0x0000 BuildDate: 20220728 SYSFW ABI: 3.1 [version: 21] [21.9.1--v2021.09a (Terrific Lla]

Board: IOT2050-ADVANCED Serial: NNEM7528 MLFB: 6ES7647-0BA00-1YA2 UUID: 8069DFE5E3D14C05B9B8D6C4CAC1CEF0 A5E: A5E452229880AB08 MAC[0]: 8c-f3-19-4c-cd-a6 MAC[1]: 8c-f3-19-4c-cd-a5 SKU: SE Loading PK... ok PK count: 00 PK version: 00 SV: 00-00 Security ID 0x5125e53f-0x842e215d Security policy: none No options found, skip. Validating FIT... Loading image atf... Loading image tee... Loading image spl... Loading image k3-am65-iot2050-spl.dtb... NOTICE: BL31: v2.6(release): NOTICE: BL31: Built : 08:03:39, Oct 20 2022 I/TC: I/TC: OP-TEE version: 3.16.0 (gcc version 10.2.1 20210110 (Debian 10.2.1-6)) #1 Thu Oct 20 08:03:39 UTC 2022 aarch64 I/TC: WARNING: This OP-TEE configuration might be insecure! I/TC: WARNING: Please check https://optee.readthedocs.io/en/latest/architecture/porting_guidelines.html I/TC: Primary CPU initializing I/TC: Primary CPU switching to normal world boot

U-Boot SPL 2022.01-V01.03.01.01-0-gffc3caf (Oct 20 2022 - 08:08:40 +0000) SYSFW ABI: 3.1 (firmware rev 0x0015 '21.9.1--v2021.09a (Terrific Lla') Trying to boot from SPI QSPI: QSPI is still busy after poll for 10000 times. SPI probe failed. SPL: failed to boot from all boot devices

ERROR ### Please RESET the board

With the new firmware V1.3.1 this does not work. I left the IOT running for 60 days without rebooting

rFond commented 9 months ago

After hard reset :

SIMATIC IOT2050 SE-Boot Version: D01.02.02.08-0-gf12580b7-0x0000 BuildDate: 20220728 SYSFW ABI: 3.1 [version: 21] [21.9.1--v2021.09a (Terrific Lla]

Board: IOT2050-ADVANCED Serial: NNEM7528 MLFB: 6ES7647-0BA00-1YA2 UUID: 8069DFE5E3D14C05B9B8D6C4CAC1CEF0 A5E: A5E452229880AB08 MAC[0]: 8c-f3-19-4c-cd-a6 MAC[1]: 8c-f3-19-4c-cd-a5 SKU: SE Loading PK... ok PK count: 00 PK version: 00 SV: 00-00 Security ID 0x5125e53f-0x842e215d Security policy: none No options found, skip. Validating FIT... Loading image atf... Loading image tee... Loading image spl... Loading image k3-am65-iot2050-spl.dtb... NOTICE: BL31: v2.6(release): NOTICE: BL31: Built : 08:03:39, Oct 20 2022 I/TC: I/TC: OP-TEE version: 3.16.0 (gcc version 10.2.1 20210110 (Debian 10.2.1-6)) #1 Thu Oct 20 08:03:39 UTC 2022 aarch64 I/TC: WARNING: This OP-TEE configuration might be insecure! I/TC: WARNING: Please check https://optee.readthedocs.io/en/latest/architecture/porting_guidelines.html I/TC: Primary CPU initializing I/TC: Primary CPU switching to normal world boot

U-Boot SPL 2022.01-V01.03.01.01-0-gffc3caf (Oct 20 2022 - 08:08:40 +0000) SYSFW ABI: 3.1 (firmware rev 0x0015 '21.9.1--v2021.09a (Terrific Lla') Trying to boot from SPI

U-Boot 2022.01-V01.03.01.01-0-gffc3caf (Oct 20 2022 - 08:08:40 +0000)

Model: SIMATIC IOT2050 Advanced DRAM: 2 GiB WDT: Not starting watchdog@40610000 MMC: mmc@4f80000: 1, mmc@4fa0000: 0 Loading Environment from SPIFlash... SF: Detected w25q128 with page size 256 Bytes, erase size 64 KiB, total 16 MiB OK In: serial Out: serial Err: serial Net: No ethernet found. Hit any key to stop autoboot: 0

jan-kiszka commented 9 months ago

Just to keep you informed: This issue was seen by a second user as well, unfortunately with the same extreme long reproduction times.

adannenb-ti commented 9 months ago

@jan-kiszka, all, For cross-reference reference, this is now also being discussed at https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1274143/am6548-u-boot-spl-could-not-load-u-boot-proper-qspi-qspi-is-still-busy-after-poll-for-10000-times

kogi84 commented 8 months ago

Hallo, we have the same error on many IOTs at various customers, what is realy bad for us und our customers! I did't expect an "### ERROR ### Please RESET the board ###" issue from a Siemens Simatic device!

[86383.521081] reboot: Restarting system

SIMATIC IOT2050 SE-Boot Version: V01.02.01-0-g4524e967-0x0000 BuildDate: 20211221 SYSFW ABI: 3.1 [version: 21] [21.5.0--v2021.05 (Terrific Llam] AVS@[1100 1150 1150]

Board: IOT2050-ADVANCED-PG2 Serial: PODC7888 MLFB: 6ES7647-0BA00-1YA2 UUID: 1A61ACB0F3DE45EAB28B6D4B482EACB2 A5E: A5E520802600AA01 MAC[0]: 8c-f3-19-c5-30-7b MAC[1]: 8c-f3-19-c5-30-7a SKU: SE Loading PK... ok PK count: 00 PK version: 00 SV: 00-00 Security ID 0xc8beb00d-0x7f55c02e Security policy: soft Loading image atf... Loading image tee... Loading image spl... Loading image k3-am65-iot2050-spl.dtb... NOTICE: BL31: v2.5(release): NOTICE: BL31: Built : 14:41:24, May 17 2021 I/TC: I/TC: OP-TEE version: 3.12.0 (gcc version 10.2.1 20210110 (Debian 10.2.1-6)) #1 Wed Jan 20 17:48:48 UTC 2021 aarch64 I/TC: Primary CPU initializing I/TC: Primary CPU switching to normal world boot

U-Boot SPL 2021.04-V01.02.01-0-g40d3fc0 (Jan 05 2022 - 14:11:27 +0000) SYSFW ABI: 3.1 (firmware rev 0x0015 '21.5.0--v2021.05 (Terrific Llam') Trying to boot from SPI QSPI: QSPI is still busy after poll for 10000 times. SF: Calibration failed (read) cadence_spi spi@47040000: Cannot set speed (err=-5) SPI probe failed. SPL: failed to boot from all boot devices

ERROR ### Please RESET the board

jan-kiszka commented 8 months ago

@kogi84 We are hunting that issue, but it is a very shy beast. So far, we are only aware of cases that happened after months(!) of uptime. Therefore, we are very interesting in learning about more details of the circumstances under which it happened to you. Eventually, we will have to reproduce it much more quickly.

And if it turns out that uptime is a precondition, it could help to have sufficiently frequent scheduled reboots - provided your setup permits those.

kogi84 commented 8 months ago

I have also posted the issue at the Siemens IOT2050 support forum https://support.industry.siemens.com/forum/de/en/posts/iot2050-stat-led-stops-blinking-os-hangs-up/304828/?page=0&pageSize=10#!#post1172396. The IOTs are powered 24/7 but we do periodic reboots, but rebooting causes the error. And yes it seems that the uptime is relevant, in our cases the error usually only occurs after about 50 days of operation. We have many IOTs in use at various customers that have this error, what is realy bad for us und our customers!

jan-kiszka commented 8 months ago

OK, that 50 days is - unfortunately - in line with what we know so far. For that reason, another affected user is now going for a nightly scheduled reboot.

jan-kiszka commented 8 months ago

That said: have you ever experience the case when the uptime was definitely much shorter than 50 days?

kogi84 commented 8 months ago

The shortest time was 27 days, but we don't think that nightly scheduled reboots solve the problem, because we do reboot the IOTs when the maschines are not in use and this should occurs multiple times in this timespan of 27-50 days. Are there reliable results that confirm that the nightly scheduled reboots helps the other affected user to get around the problem?

jan-kiszka commented 8 months ago

We do not yet have a reliable workaround recommendation, still collecting data, in the lab as well as in the field.

kogi84 commented 8 months ago

OK, thank you, I try to collect some more data for you and as workarround we will do the nightly scheduled reboots.

kogi84 commented 8 months ago

Hallo, I had a closer look to our loggings and I am shure, that the nightly scheduled reboots don't solve the problem in our case, because our monitored IOT reboots after 86383 seconds uptime, what is arround 24h, you can see it in my first post:

[86383.521081] reboot: Restarting system_ and runs into the "ERROR ### Please RESET the board". We have arround 40 IOTs at various customers in the field, running in this error arround 50 days, really bad!

kogi84 commented 6 months ago

Hallo, is there any progress regarding the issue and resolution for the problem?

rFond commented 4 months ago

Hello, I'm getting back to you regarding this issue. From what I understand, the issue is hardware-related? So, do we need to occasionally reboot the IoT device via software to resolve this problem? Thank you

huaqianli commented 4 months ago

Hi,

Occasionally rebooting the IoT device may not be able to resolve this issue, as it occurred once during our daily-reboot testing before.

And, after 55 days of daily-reboot testing, I reproduced the similar issue on January 3, 2024, in a version shipped with the following patch:

image

The captured debug log shows that the error occurred when calling cadence_qspi_apb_command_read with "CQSPI_REG_CONFIG: 80083881" for the first time, as follows:

abnormal register: 
  Source: cadence_qspi_apb_command_read
   CQSPI_REG_CONFIG:              80083881

A bit strange is noted: bit 31 should be 0 in the error case because it is CQSPI_REG_CONFIG_IDLE_LSB, and that is the condition that CQSPI_REG_IS_IDLE is testing for. Otherwise, we can't get more information from this case.

However, this patch had redundant modifications; the latency was reduced a lot, from 5 seconds to 5 milliseconds. This makes it different from the original issue.

I fixed the patch, added more debugging information, and started reproducing on January 16th, but it hasn't been reproduced yet.

cpardotortosa commented 1 month ago

Any advancement?

pscom038 commented 1 week ago

I use IOT2050 firmware V1.3.1 on eMMc found same problem after reset. device run time about 180 days and reset via VPN command line.

Now, is there a way to solve this problem?