oracle-cne / ocne

The Oracle Cloud Native Environment CLI
Universal Permissive License v1.0
1 stars 0 forks source link

Keepalived kube-apiserver liveness probe has too short a timeout #72

Closed dkrasins closed 1 week ago

dkrasins commented 1 week ago

The current liveness probe that keepalived uses to determine if it can get a list of cluster nodes is too short. It results in spurious drops of the VIP when there is resource contention on the hosting system. This has been a problem for data intensive operations on small systems. A commonly observed failure is something like the following on systems with 2 cpus:

$ ocne image create
...
INFO[2024-10-07T01:29:23Z] Preparing pod used to create image           
INFO[2024-10-07T01:29:29Z] Waiting for pod ocne-system/ocne-image-builder to be ready: ok 
INFO[2024-10-07T01:29:29Z] Getting local boot image for architecture: arm64 
E1007 01:29:45.332268   59450 v2.go:104] write tcp 127.0.0.1:48594->127.0.0.1:6443: write: connection reset by peer
E1007 01:29:45.332295   59450 v2.go:129] next reader: websocket: close 1006 (abnormal closure): unexpected EOF
E1007 01:29:45.332313   59450 v2.go:150] next reader: websocket: close 1006 (abnormal closure): unexpected EOF
E1007 01:29:45.332269   59450 v2.go:167] next reader: websocket: close 1006 (abnormal closure): unexpected EOF
error: error reading from error stream: next reader: websocket: close 1006 (abnormal closure): unexpected EOF

It is currently 1 second: https://github.com/oracle-cne/ocne/blob/main/pkg/cluster/ignition/virtual_ip.go#L69

1 minute is a better number. It offers allows for a significant amount of contention while also accounting for the fact that the service may be locked or so bogged down as to be effectively inoperable.