nix-community / srvos

NixOS profiles for servers [maintainer=@numtide]
https://nix-community.github.io/srvos
MIT License
549 stars 29 forks source link

decrease watchdog runtimeTime for Raspberry Pi compatibility #523

Closed danjujan closed 1 month ago

danjujan commented 1 month ago

Lower the watchdog interval to be Raspberry Pi compatible. See https://pimylifeup.com/raspberry-pi-watchdog/

SuperSandro2000 commented 1 month ago

Doesn't this make it more trigger happy for all other arches? Can we maybe detect PIs and only lower it for them?

danjujan commented 1 month ago

Yes, that might be a drawback. However, I believe srvos should have a default that works across all systems. I'm not aware of any reliable way to detect PIs specifically. The only viable solution I see is to limit this change to the corresponding platforms, i.e. armv6-linux, armv7-linux, aarch64-linux. Would that be an option? Otherwise we could use mkDefault to make it easy to change the 15s value.

sedlund commented 1 month ago

maybe as a hardware module specific for Raspi in https://github.com/nix-community/srvos/tree/main/nixos/hardware ?

It should be documented as its a limitation of the bcm2835_wdt driver. I believe the limit is actually 16 seconds.

https://github.com/torvalds/linux/blob/075dbe9f6e3c21596c5245826a4ee1f1c1676eb8/drivers/watchdog/bcm2835_wdt.c#L29

#define PM_WDOG_TIME_SET        0x000fffff   (1048576)
#define WDOG_TICKS_TO_MSECS(x) ((x) * 1000 >> 16)

.max_hw_heartbeat_ms =  WDOG_TICKS_TO_MSECS(PM_WDOG_TIME_SET),

(1048576 * 1000) / 2^16 = 16000 ms 16000/1000 = 16sec

Also, I have had to disable watchdog on pi when doing nixos activations that have many systemd restarts with slow SD IO which causes the system to reset before the activation completes.

sedlund commented 1 month ago

Rethinking this. If you set the timer longer than 16 seconds, it will be set to 16 seconds due to the limit imposed above.

This patch does nothing for pi. Which is actually kind of nice. 💖

danjujan commented 1 month ago

Rethinking this. If you set the timer longer than 16 seconds, it will be set to 16 seconds due to the limit imposed above.

This patch does nothing for pi. Which is actually kind of nice. 💖

You are right, setting it to 16s works. However, anything higher wraps around and may cause sudden reboots. 20s is effectively the same as setting 4s.

danjujan commented 1 month ago

maybe as a hardware module specific for Raspi in https://github.com/nix-community/srvos/tree/main/nixos/hardware ?

Something like https://github.com/danjujan/srvos/commit/8e4491e85d9ae38e8d3b1f15c02b3bb5803e6e68 ?

Mic92 commented 1 month ago

Can you make a raspberry hardware profile for this? https://github.com/nix-community/srvos/tree/main/nixos/hardware

We can also otherwise set this value in nixos-hardware?

Otherwise we could also only set the watchdog timer on x86_64 for now and leave aarch64 to the user.

Mic92 commented 1 month ago

Ok. Let's just set it to 15s instead...

Mic92 commented 2 weeks ago

I see now an increased failure rate on x86 based servers. Ipmi shows this is caused by watchdogs.

Mic92 commented 2 weeks ago

I will try to isolate this a bit