Open githubcdr opened 2 months ago
I think nodes handle reconnect in that case. So if nodes goes down and gets rebooted it should check into Omni immediately. While when Omni is down I guess it also tries to reconnect, but does retries with backoff. Omni should keep the last known endpoints for the nodes, but maybe there's something wrong.
Worth checking if there is a bug.
Do the machines reconnect back after some time?
I had to reboot them manually, but only some disappeared, I gave it 4 hours before doing so
Just happened again; reproducible in my case by stopping omni for 5 minutes and start it again, the wireguard connection error will pop up in the logs and some hosts become isolated.
Please make sure you're running recent enough Talos, and attach here logs of the Talos node after it gets disconnected.
Im running latest version, maybe stale Siderolinks are causing this issue? I run omni in a testing env and create and destroy a lot of clusters. When all nodes are green I still see the logs below
Sometimes a node is greyed out, no logs available, but I can still reboot the node via omni.
{"level":"warn","ts":1726910417.2707446,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910418.1342225,"caller":"device/send.go:138","msg":"peer(RVqv…Z3Fc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910419.3680203,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910420.5396538,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910421.4085839,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910421.5408869,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910422.4702663,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910422.5636215,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"k8s_proxy","request_url":"/apis/metrics.k8s.io/v1beta1/pods","method":"GET","remote_addr":"xxx","duration":0.030410022,"status":503,"response_length":20,"cluster":"silly-skynet","cluster_uuid":"","impersonate.user":"info@codar.nl","impersonate.groups":["system:masters"]}
{"level":"warn","ts":1726910423.1459103,"caller":"device/send.go:138","msg":"peer(RVqv…Z3Fc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910424.6307366,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910425.7384636,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910426.5885136,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910426.6452115,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910427.6772408,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
...
{"level":"warn","ts":1726910660.4494154,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910660.7291093,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/admin/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxx","duration":0.00004044,"status":200,"response_length":1730}
{"level":"warn","ts":1726910662.7074013,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910662.7517579,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/backup/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxx","duration":0.0000428,"status":200,"response_length":1730}
{"level":"warn","ts":1726910662.9705136,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910663.171093,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910663.3321455,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/blog/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxxx","duration":0.00004516,"status":200,"response_length":17
Things seem to be a lot more stable for me after updating to 1.8.0 and removing disconnected old hosts, so far so good!
Maybe this node logs helps
01/01/1970 11:17:24
[talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dtalos-q7f-tv7&resourceVersion=7429241\": EOF", "error_count": 4}
01/01/1970 11:18:08
[talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dtalos-q7f-tv7&limit=500&resourceVersion=0\": EOF", "error_count": 0}
01/01/1970 17:05:27
[talos] error watching discovery service state {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"}
This is totally unrelated. If Talos disconnects from Omni, the logs will say about SideroLink
.
probably unrelated, but the time is off by a lot somewhere in your infra ;)
The date on the node is correct, it's an optical UI issue in omni, I can reproduce this issue;
The nodes are connected by cable on a stable network, it seems to be at random with this as last node message
[talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://[fdae:41e4303::1]:10000/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=645709\": dial tcp [fdae:41303::1]:10000: i/o timeout"}
I tried using internal discovery or public, results are the same. The k8s cluster itself is operational, the node is reachable and port 50000 is open indicating that the api-server is operational, sometimes a reboot works (which is strange since it depends on the wireguard connection)
I'll keep debugging, thanks for all your work.
Is there an existing issue for this?
Current Behavior
When upgrading omni I sometime notice that Siderolinks seem to go down for a long time. Nodes are not available during this downtime.
Expected Behavior
Node repair (restore) connections to omni in case of unexpected disconnects. I would expect at least a reconnect on failure from nodes.
Steps To Reproduce
Bring Omni 0.42.3 down for a few minutes and restart, some nodes are not available/connected.
What browsers are you seeing the problem on?
No response
Anything else?
A reboot command to the node sometimes works.