Closed gerhard closed 1 year ago
Omni doesn't use IPv6 connectivity, only IPv4 to the SaaS. What you see is a tunneled ULA address which has nothing to do with your public outbound path. Talos to Omni connectivity is based on the Wireguard, and it should reconnect back, but there might be some issues related to your setup, as the endpoint observed by Talos might not work anymore as your router changes the WAN path, same way on Omni side the endpoint changes with the WAN path change, so basically Talos needs to reset the Wireguard endpoint to re-establish the connectivity. We will be looking more into SideroLink reliability issues under changing conditions in the future releases.
Just to double-check my understanding, this is a Talos to Omni WireGuard issue. WireGuard runs IPv6 on top of IPv4. When IPv4 routing changes, WireGuard IPv6 connections do not get re-established.
Is there something else that I can do apart from rebooting the host?
IPv6 is used over the Wireguard tunnel, so it can be ignored here completely.
Everything else is more about the way Wireguard works. In your case, when the WAN connection changes, both sides (Omni and Talos) lose connectivity, as from the point of view of both them peer endpoints changed. The solution is for Talos to reset the endpoint back to the Omni public IP endpoint, but it's not smart enough to do that at the moment. We will look into improving resiliency of the Wireguard connection taking into account your case as well.
OK! Sounds great 👍
Any update on this?
FWIW, I am starting to ramp up more production workloads on Talos OS: https://github.com/metal-stack/csi-driver-lvm/pull/87#issuecomment-1527048244. It would be good to have a rough idea of a timeline for this. Knowing whether this will take weeks or months will help me calibrate my check-ins.
We don't have any exact timeline at the moment, the fix should land in Talos, and it will be backported to Talos 1.4.x.
I would say around ~2 weeks.
OK. I will keep this one open and check again after a few weeks.
FWIW, https://github.com/siderolabs/omni-feedback/issues/41 is more impactful than this one. I can live with this, but I am limited as to how many production workloads with no memory limits I can add to my clusters.
This now seems to work as expected.
My primary WAN disconnected for a few minutes, the machine continued being available in Omni.
I am going to manually introduce a longer disconnect to see how it behaves.
This now works as expected on the Talos OS v1.4.1
cluster managed by Omni v0.8.1
.
Closing - thank you! 💪
Is there an existing issue for this?
Current Behavior
I have the following Talos
v1.4.0
machine managed by Omniv0.7.0
:The machine appears as
disconnected
. If I go to the Node page on Omni, this is what I see:Notice the
Error: connection error: desc = "transport: Error while dialing: dial tcp [fdae:41e4:649b:9303:efd8:e016:40fe:63ed]:50000: i/o timeout".
This is the console view on that machine:
Workloads that run on this machine however are publicly accessible:
If traffic is routable to (and from) the machine, and Omni shows this machine as disconnected on the IPv6 interface, is it possible that the link is not being re-established after WAN connectivity is restored?
FWIW, I have dual WAN running in active/active mode, each with a different route metric. As soon as the primary WAN becomes unavailable, all traffic automatically shifts to the secondary WAN route. When primary WAN is restored, this becomes the preferred
0.0.0.0/0
route.My suspicion is that Talos/Omni does not properly handle network unavailability. When my primary WAN disconnected and routing shifted to the secondary, Talos/Omni didn't recycle the existing IPv6 connection. I am assuming that this IPv6 connection is still running in the machine, but no packets are able to go through.
Expected Behavior
I would expect the Talos machine to restore connectivity to Omni within a minute (preferably less) when there are any networking issues.
Steps To Reproduce
It would be more sensible to show you. It would be difficult to reproduce my setup. In a nutshell:
What browsers are you seeing the problem on?
No response
Anything else?
My current workaround is to power cycle the Talos node. It's not great, but thank goodness for Supermicro's web-based IPMI which still kind of works (except the Java bit).