double check that watchdog is still watching during reboot

victronenergy / venus

Victron Energy Unix/Linux OS

https://github.com/victronenergy/venus/wiki

577 stars 72 forks source link

double check that watchdog is still watching during reboot #57

Closed mpvader closed 7 years ago

mpvader commented 8 years ago

On the current released firmware on the CCGX (= Danny), special care has been taken to keep the watchdog armed during a reboot. To make sure that the device does not get stuck during a reboot.

See watchdog.bbappend in meta-victronenergy for details.

While trying out the beaglebone enhanced, I found an issue that made the bbe get stuck while unloading the wifi driver during a reboot. Which has been solved.

So, task of this issue is to verify that watchdog is still active during the reboot. And if it is not (which I think is the case), fix that.

mpvader commented 7 years ago

recently, watchdog recipe was moved from meta-ccgx (gitlab) to meta-victronenergy. I've updated the link in above text.

mpvader commented 7 years ago

Note two related issues: arm the watchdog already from u-boot, to catch sporadic boot-up issues. See #73 and #74.

mansr commented 7 years ago

The beaglebone uses the same startup/shutdown scripts as ccgx with a runlevel 6 (reboot) entry that stops the watchdog daemon. A possible issue with this is that several shutdown scripts are run before this, and if one of those gets stuck, it is of no help. A better solution might be to set the CONFIG_WATCHDOG_NOWAYOUT kernel option. With this enabled, the hardware watchdog will keep counting even after the daemon process exits. Stopping the daemon normally in runlevel 6 should then ensure that the hw watchdog times out if anything later gets stuck.

jhofstee commented 7 years ago

@mansr not sure I follow you completely. The ccgx does this: https://git.victronenergy.com/ccgx/meta-ccgx/blob/master/meta-venus/recipes-core/initscripts/sendsigs#L16

so the watchdog is stopped and not killed on a ccgx, so the hardware watchdog still counts. On a beaglebone the killall kills the watchdog and hence is disabled.

mpvader commented 7 years ago

To add a bit:

On Danny (ccgx) we use initscripts 1.1: https://git.victronenergy.com/ccgx/meta-ccgx/blob/master/meta-venus/recipes-core/initscripts_1.1.bb which is a by us modified version of the default initscripts.

On Jethro (beaglebone) we use initscripts 1.0. Which is the OE default. And have an append on that: https://github.com/victronenergy/meta-victronenergy/blob/master/meta-venus/recipes-core/initscripts/initscripts_%25.bbappend

(and sorry @mansr, when we spoke on the phone I didn't think of this difference)

mansr commented 7 years ago

Sorry, I missed the sendsigs changes. Anyhow, using the WATCHDOG_NOWAYOUT option would remove the need for all of this.

jhofstee commented 7 years ago

agreed, the main reason we have the STOP + don't kill the watchdog is that the watchdog can still be gracefully stopped. In the opkg update time the preinstall would shutdown the watchdog service and I wanted to be absolutely sure that couldn't accidentally lead to watchdog resets during the opkg upgrade.

Since we have image updates now, this is no longer an issue. I tend to prefer doing it completely in userland though since it doesn't depend on how linux is configured. Besides that it feels a bit more comfortable if there is a way out if you really want to ;). Anyway this issue is easily fixed by either not killing the watchdog process or with WATCHDOG_NOWAYOUT.

jhofstee commented 7 years ago

see https://github.com/victronenergy/meta-victronenergy/commit/b2f91bdd33c49df2f39ca3a00b5e2be279a8ab58 for details. A third option which doesn't need linux nor initscript changes. Needs testing on a ccgx though.

jhofstee commented 7 years ago

for completeness, the watchdog on linux 3.7 behaves different and gets disabled even the file is close without the magic token. So I enabled CONFIG_WATCHDOG_NOWAYOUT.

jhofstee commented 7 years ago

fyi, just stumbled upon this, https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/watchdog/omap_wdt.c?id=fb1cbeaeed0f41965ead2714bfc9c579188c6146 which likely explains the different behavior for 3.7 vs 4.1 linux versions..