The current liveness probe that keepalived uses to determine if it can get a list of cluster nodes is too short. It results in spurious drops of the VIP when there is resource contention on the hosting system. This has been a problem for data intensive operations on small systems. A commonly observed failure is something like the following on systems with 2 cpus:
$ ocne image create
...
INFO[2024-10-07T01:29:23Z] Preparing pod used to create image
INFO[2024-10-07T01:29:29Z] Waiting for pod ocne-system/ocne-image-builder to be ready: ok
INFO[2024-10-07T01:29:29Z] Getting local boot image for architecture: arm64
E1007 01:29:45.332268 59450 v2.go:104] write tcp 127.0.0.1:48594->127.0.0.1:6443: write: connection reset by peer
E1007 01:29:45.332295 59450 v2.go:129] next reader: websocket: close 1006 (abnormal closure): unexpected EOF
E1007 01:29:45.332313 59450 v2.go:150] next reader: websocket: close 1006 (abnormal closure): unexpected EOF
E1007 01:29:45.332269 59450 v2.go:167] next reader: websocket: close 1006 (abnormal closure): unexpected EOF
error: error reading from error stream: next reader: websocket: close 1006 (abnormal closure): unexpected EOF
1 minute is a better number. It offers allows for a significant amount of contention while also accounting for the fact that the service may be locked or so bogged down as to be effectively inoperable.
The current liveness probe that keepalived uses to determine if it can get a list of cluster nodes is too short. It results in spurious drops of the VIP when there is resource contention on the hosting system. This has been a problem for data intensive operations on small systems. A commonly observed failure is something like the following on systems with 2 cpus:
It is currently 1 second: https://github.com/oracle-cne/ocne/blob/main/pkg/cluster/ignition/virtual_ip.go#L69
1 minute is a better number. It offers allows for a significant amount of contention while also accounting for the fact that the service may be locked or so bogged down as to be effectively inoperable.