Closed nearly-big-endian closed 1 year ago
Hi @nearly-big-endian, thanks for reporting this.
We'll attempt to reproduce this and then diagnose what is going on. I suspect a small race condition in the HSS in how the individual harts are restarting when the watchdog trips.
Hi @nearly-big-endian , we have a patch being reviewed to address this issue for you.
Hi @nearly-big-endian , we have a patch being reviewed to address this issue for you.
This is great news, thanks. I am looking forward to trying this fix.
Hi @nearly-big-endian
For now the following workaround patch will restore the behavior you want. A clean fix for this (which also supports AMP correctly) will be added into the next official HSS release, but we didn't want to hold you up in the interim.
diff --git a/services/wdog/wdog_service.c b/services/wdog/wdog_service.c
index cafff21..98327bc 100644
--- a/services/wdog/wdog_service.c
+++ b/services/wdog/wdog_service.c
@@ -153,6 +153,9 @@ static void wdog_monitoring_handler(struct StateMachine * const pMyMachine)
status &= hartBitmask.uint;
if (status) {
+#if IS_ENABLED(CONFIG_ALLOW_COLDREBOOT_ALWAYS)
+ HSS_Wdog_Reboot(HSS_HART_ALL);
+#else
// watchdog timer has triggered for a monitored hart..
mHSS_DEBUG_PRINTF(LOG_ERROR, "Watchdog has triggered - %02x\n", status);
@@ -176,6 +179,7 @@ static void wdog_monitoring_handler(struct StateMachine * const pMyMachine)
HSS_Boot_RestartCore(HSS_HART_U54_4);
wdogInitTime[HSS_HART_U54_4] = HSS_GetTime();
}
+#endif
#endif
}
}
Hi,
I am very pleased to report that we have tested this patch and that it fixes the issue we were having perfectly.
Thanks very much for the quick response (and correction).
Hi,
We are relying on the watchdog mechanism to have the board rebooted in case of system freeze (Linux in our case).
That feature used to work as expected on the HSS 0.99.26 (together with Reference design 2021.11 and Polarfire Yocto BSP 2021.11) on an IcicleKit, in the sense that the board did get successfully restarted on watchdog timeout (in case it was not refreshed).
We however noticed since upgrading to HSS 0.99.33 // Ref Design 2022.09 // Yocto BSP 2022.09 that HSS fails to restart the board on first watchdog timeout signal.
It eventually manages to restart the board, but only after having received 9 more watchdog timeout signals (as the log shows below). If the watchdog runtime duration is set to 30 seconds and the watchdog is not actively refreshed, the board therefore ends up being effectively rebooted only after about 300 seconds instead of 30 seconds (that is, at the end of 10 watchdog timeout signals).
Could it be a regression in the HSS ?
Here is a set of Linux commands that may be used to reproduce the issue, using the devmem2 linux command line tool. This sets up the U54 watchdog 1 with the maximum runtime duration (about 30 seconds) and let it expire. The expected outcome is that the board gets rebooted immediately on watchdog timeout.
Here is below the HSS log, taken from the first watchdog timeout trigger event. Notice the lines 'Watchdog has triggered - 10' appearing 10 times, roughly every 30 seconds. The system finally gets restarted at the [419.13129] time mark, after 9 failed attempts.
Thanks for your inputs.