Closed danjujan closed 1 month ago
Doesn't this make it more trigger happy for all other arches? Can we maybe detect PIs and only lower it for them?
Yes, that might be a drawback. However, I believe srvos should have a default that works across all systems.
I'm not aware of any reliable way to detect PIs specifically. The only viable solution I see is to limit this change to the corresponding platforms, i.e. armv6-linux
, armv7-linux
, aarch64-linux
. Would that be an option?
Otherwise we could use mkDefault
to make it easy to change the 15s
value.
maybe as a hardware module specific for Raspi in https://github.com/nix-community/srvos/tree/main/nixos/hardware ?
It should be documented as its a limitation of the bcm2835_wdt driver. I believe the limit is actually 16 seconds.
#define PM_WDOG_TIME_SET 0x000fffff (1048576)
#define WDOG_TICKS_TO_MSECS(x) ((x) * 1000 >> 16)
.max_hw_heartbeat_ms = WDOG_TICKS_TO_MSECS(PM_WDOG_TIME_SET),
(1048576 * 1000) / 2^16 = 16000 ms 16000/1000 = 16sec
Also, I have had to disable watchdog on pi when doing nixos activations that have many systemd restarts with slow SD IO which causes the system to reset before the activation completes.
Rethinking this. If you set the timer longer than 16 seconds, it will be set to 16 seconds due to the limit imposed above.
This patch does nothing for pi. Which is actually kind of nice. 💖
Rethinking this. If you set the timer longer than 16 seconds, it will be set to 16 seconds due to the limit imposed above.
This patch does nothing for pi. Which is actually kind of nice. 💖
You are right, setting it to 16s works. However, anything higher wraps around and may cause sudden reboots. 20s is effectively the same as setting 4s.
maybe as a hardware module specific for Raspi in https://github.com/nix-community/srvos/tree/main/nixos/hardware ?
Something like https://github.com/danjujan/srvos/commit/8e4491e85d9ae38e8d3b1f15c02b3bb5803e6e68 ?
Can you make a raspberry hardware profile for this? https://github.com/nix-community/srvos/tree/main/nixos/hardware
We can also otherwise set this value in nixos-hardware?
Otherwise we could also only set the watchdog timer on x86_64 for now and leave aarch64 to the user.
Ok. Let's just set it to 15s instead...
I see now an increased failure rate on x86 based servers. Ipmi shows this is caused by watchdogs.
I will try to isolate this a bit
Lower the watchdog interval to be Raspberry Pi compatible. See https://pimylifeup.com/raspberry-pi-watchdog/