siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
567 stars 35 forks source link

[bug] machine reconnect after omni downtime #638

Open githubcdr opened 2 months ago

githubcdr commented 2 months ago

Is there an existing issue for this?

Current Behavior

When upgrading omni I sometime notice that Siderolinks seem to go down for a long time. Nodes are not available during this downtime.

Expected Behavior

Node repair (restore) connections to omni in case of unexpected disconnects. I would expect at least a reconnect on failure from nodes.

Steps To Reproduce

Bring Omni 0.42.3 down for a few minutes and restart, some nodes are not available/connected.

{"level":"warn","ts":1726598097.5886366,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726598098.365197,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726598098.4972,"caller":"device/send.go:138","msg":"peer(OyBh…3QWc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726598098.510401,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}

What browsers are you seeing the problem on?

No response

Anything else?

A reboot command to the node sometimes works.

Unix4ever commented 2 months ago

I think nodes handle reconnect in that case. So if nodes goes down and gets rebooted it should check into Omni immediately. While when Omni is down I guess it also tries to reconnect, but does retries with backoff. Omni should keep the last known endpoints for the nodes, but maybe there's something wrong.

Worth checking if there is a bug.

Unix4ever commented 2 months ago

Do the machines reconnect back after some time?

githubcdr commented 2 months ago

I had to reboot them manually, but only some disappeared, I gave it 4 hours before doing so

githubcdr commented 2 months ago

Just happened again; reproducible in my case by stopping omni for 5 minutes and start it again, the wireguard connection error will pop up in the logs and some hosts become isolated.

smira commented 1 month ago

Please make sure you're running recent enough Talos, and attach here logs of the Talos node after it gets disconnected.

githubcdr commented 1 month ago

Im running latest version, maybe stale Siderolinks are causing this issue? I run omni in a testing env and create and destroy a lot of clusters. When all nodes are green I still see the logs below

Sometimes a node is greyed out, no logs available, but I can still reboot the node via omni.

{"level":"warn","ts":1726910417.2707446,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910418.1342225,"caller":"device/send.go:138","msg":"peer(RVqv…Z3Fc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910419.3680203,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910420.5396538,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910421.4085839,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910421.5408869,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910422.4702663,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910422.5636215,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"k8s_proxy","request_url":"/apis/metrics.k8s.io/v1beta1/pods","method":"GET","remote_addr":"xxx","duration":0.030410022,"status":503,"response_length":20,"cluster":"silly-skynet","cluster_uuid":"","impersonate.user":"info@codar.nl","impersonate.groups":["system:masters"]}
{"level":"warn","ts":1726910423.1459103,"caller":"device/send.go:138","msg":"peer(RVqv…Z3Fc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910424.6307366,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910425.7384636,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910426.5885136,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910426.6452115,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910427.6772408,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
...
{"level":"warn","ts":1726910660.4494154,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910660.7291093,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/admin/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxx","duration":0.00004044,"status":200,"response_length":1730}
{"level":"warn","ts":1726910662.7074013,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910662.7517579,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/backup/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxx","duration":0.0000428,"status":200,"response_length":1730}
{"level":"warn","ts":1726910662.9705136,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910663.171093,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910663.3321455,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/blog/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxxx","duration":0.00004516,"status":200,"response_length":17
githubcdr commented 1 month ago

Things seem to be a lot more stable for me after updating to 1.8.0 and removing disconnected old hosts, so far so good!

githubcdr commented 1 month ago

Maybe this node logs helps

01/01/1970 11:17:24
[talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dtalos-q7f-tv7&resourceVersion=7429241\": EOF", "error_count": 4}
01/01/1970 11:18:08
[talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dtalos-q7f-tv7&limit=500&resourceVersion=0\": EOF", "error_count": 0}
01/01/1970 17:05:27
[talos] error watching discovery service state {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"}
smira commented 1 month ago

This is totally unrelated. If Talos disconnects from Omni, the logs will say about SideroLink.

probably unrelated, but the time is off by a lot somewhere in your infra ;)

githubcdr commented 1 month ago

The date on the node is correct, it's an optical UI issue in omni, I can reproduce this issue;

The nodes are connected by cable on a stable network, it seems to be at random with this as last node message

[talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://[fdae:41e4303::1]:10000/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=645709\": dial tcp [fdae:41303::1]:10000: i/o timeout"}

I tried using internal discovery or public, results are the same. The k8s cluster itself is operational, the node is reachable and port 50000 is open indicating that the api-server is operational, sometimes a reboot works (which is strange since it depends on the wireguard connection)

I'll keep debugging, thanks for all your work.