[bug] Machine remains disconnected after WAN connectivity is restored

gerhard commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

I have the following Talos v1.4.0 machine managed by Omni v0.7.0:

The machine appears as disconnected. If I go to the Node page on Omni, this is what I see:

Notice the Error: connection error: desc = "transport: Error while dialing: dial tcp [fdae:41e4:649b:9303:efd8:e016:40fe:63ed]:50000: i/o timeout".

This is the console view on that machine: Snapshot

Workloads that run on this machine however are publicly accessible:

httpstat <FQDN>
Connected to <WAN_IP>:80 from <LOCAL_PUBLIC_IP>:59808

HTTP/1.1 200 OK
Date: Thu, 20 Apr 2023 06:25:01 GMT
Content-Type: text/html
Content-Length: 615
Connection: keep-alive
Last-Modified: Tue, 11 Apr 2023 01:45:34 GMT
ETag: "6434bbbe-267"
Accept-Ranges: bytes

Body stored in: /tmp/tmpAPLNPB

  DNS Lookup   TCP Connection   Server Processing   Content Transfer
[     4ms    |      12ms      |       13ms        |        0ms       ]
             |                |                   |                  |
    namelookup:4ms            |                   |                  |
                        connect:16ms              |                  |
                                      starttransfer:29ms             |
                                                                 total:29ms

If traffic is routable to (and from) the machine, and Omni shows this machine as disconnected on the IPv6 interface, is it possible that the link is not being re-established after WAN connectivity is restored?

FWIW, I have dual WAN running in active/active mode, each with a different route metric. As soon as the primary WAN becomes unavailable, all traffic automatically shifts to the secondary WAN route. When primary WAN is restored, this becomes the preferred 0.0.0.0/0 route.

My suspicion is that Talos/Omni does not properly handle network unavailability. When my primary WAN disconnected and routing shifted to the secondary, Talos/Omni didn't recycle the existing IPv6 connection. I am assuming that this IPv6 connection is still running in the machine, but no packets are able to go through.

Expected Behavior

I would expect the Talos machine to restore connectivity to Omni within a minute (preferably less) when there are any networking issues.

Steps To Reproduce

It would be more sensible to show you. It would be difficult to reproduce my setup. In a nutshell:

flowchart LR
    wan1
    wan2
    router
    switch
    talos

    wan1 === |0.0.0.0/0 metric 3| router
    wan1 === |active| router

    wan2 --- |0.0.0.0/0 metric 4| router
    wan2 --- |active| router

    router === switch === talos

What browsers are you seeing the problem on?

No response

Anything else?

My current workaround is to power cycle the Talos node. It's not great, but thank goodness for Supermicro's web-based IPMI which still kind of works (except the Java bit).

smira commented 1 year ago

Omni doesn't use IPv6 connectivity, only IPv4 to the SaaS. What you see is a tunneled ULA address which has nothing to do with your public outbound path. Talos to Omni connectivity is based on the Wireguard, and it should reconnect back, but there might be some issues related to your setup, as the endpoint observed by Talos might not work anymore as your router changes the WAN path, same way on Omni side the endpoint changes with the WAN path change, so basically Talos needs to reset the Wireguard endpoint to re-establish the connectivity. We will be looking more into SideroLink reliability issues under changing conditions in the future releases.

gerhard commented 1 year ago

Just to double-check my understanding, this is a Talos to Omni WireGuard issue. WireGuard runs IPv6 on top of IPv4. When IPv4 routing changes, WireGuard IPv6 connections do not get re-established.

Is there something else that I can do apart from rebooting the host?

smira commented 1 year ago

IPv6 is used over the Wireguard tunnel, so it can be ignored here completely.

Everything else is more about the way Wireguard works. In your case, when the WAN connection changes, both sides (Omni and Talos) lose connectivity, as from the point of view of both them peer endpoints changed. The solution is for Talos to reset the endpoint back to the Omni public IP endpoint, but it's not smart enough to do that at the moment. We will look into improving resiliency of the Wireguard connection taking into account your case as well.

gerhard commented 1 year ago

OK! Sounds great 👍

gerhard commented 1 year ago

Any update on this?

FWIW, I am starting to ramp up more production workloads on Talos OS: https://github.com/metal-stack/csi-driver-lvm/pull/87#issuecomment-1527048244. It would be good to have a rough idea of a timeline for this. Knowing whether this will take weeks or months will help me calibrate my check-ins.

smira commented 1 year ago

We don't have any exact timeline at the moment, the fix should land in Talos, and it will be backported to Talos 1.4.x.

I would say around ~2 weeks.

gerhard commented 1 year ago

OK. I will keep this one open and check again after a few weeks.

FWIW, https://github.com/siderolabs/omni-feedback/issues/41 is more impactful than this one. I can live with this, but I am limited as to how many production workloads with no memory limits I can add to my clusters.

gerhard commented 1 year ago

This now seems to work as expected.

My primary WAN disconnected for a few minutes, the machine continued being available in Omni.

I am going to manually introduce a longer disconnect to see how it behaves.

gerhard commented 1 year ago

This now works as expected on the Talos OS v1.4.1 cluster managed by Omni v0.8.1.

Closing - thank you! 💪

siderolabs / omni-feedback