siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
568 stars 35 forks source link

[bug] talosctl IO timeout talking to node through omni endpoint #480

Closed keatsfonam closed 4 months ago

keatsfonam commented 4 months ago

Is there an existing issue for this?

Current Behavior

I am testing a single node cluster created with omni. When using talosctl through the omni endpoint it is unable to connect to the node (192.168.100.150) in question:

❯ talosctl --talosconfig talosconfig -n 192.168.100.150 disks
error getting disks: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 192.168.100.150:50000: i/o timeout"

The node is reachable within the private network it lives in, but omni cannot route to it. omnictl and kubectl are functioning properly through an omni endpoint. All omni functionality seems to be intact.

When testing using the node itself as an endpoint traffic is flowing properly and stops at the point of asking for a client cert:

❯ talosctl --talosconfig talosconfig -e 192.168.100.150 -n 192.168.100.150 disks 
error getting disks: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority"

# insecure:
❯ talosctl --talosconfig talosconfig -e 192.168.100.150 -n 192.168.100.150 disks -insecure
error getting disks: rpc error: code = Unavailable desc = last connection error: connection error: desc = "error reading server preface: remote error: tls: certificate required"

Test the node can reach it's own IP:

❯ openssl s_client -connect 192.168.100.150:50000
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN = talos-xxx-yy
...
---
Certificate chain
 0 s:CN = talos-xxx-yyy
...
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIBdz....
-----END CERTIFICATE-----
subject=CN = talos-xxx-yyy
issuer=O = talos

When testing using the ipv6 management address from the tunnel talosctl can talk to the node:

❯ talosctl --talosconfig talosconfig -n <IPv6 address from omnictl get machine> disks
error getting disks: rpc error: code = Unavailable desc = last connection error: connection error: desc = "error reading server preface: remote error: tls: certificate required"

Expected Behavior

talosctl can talk to the node through the omni endpoint

Steps To Reproduce

Single node cluster. Omni is on second network and cannot route to the private network directly. Node

talosctl version: v1.7.5 omni version: v1.30.1

talosconfig:

context: omni
contexts:
    omni:
        endpoints:
            - https://omni.domain.com
        auth:
            siderov1:
                identity: user@email.com

omni flags

            --account-id=${uuid}
            --name=${name}
            --private-key-source='file:///omni.asc'
            --event-sink-port=${event_sink_port}
            --bind-addr=0.0.0.0:${ui_port}
            --machine-api-bind-addr=0.0.0.0:${api_port}
            --k8s-proxy-bind-addr=${proxy_bind_addr}
            --advertised-api-url=https://${domain_name}
            --advertised-kubernetes-proxy-url=https://${domain_name}:${frontend_proxy_port}/
            --siderolink-api-advertised-url=https://${var.domain_name}:${frontend_api_port}/
            --siderolink-wireguard-advertised-addr=$(public_ipv4):${wireguard_port}
            --initial-users="${initial_user}"
            --auth-auth0-enabled=true
            --auth-auth0-domain=${auth0_domain}
            --auth-auth0-client-id=${auth0_client_id}

What browsers are you seeing the problem on?

No response

Anything else?

I will bring up more nodes to continue testing but I would expect it to work with a single node

smira commented 4 months ago

Can you please attach omnictl support bundle?

keatsfonam commented 4 months ago

Attached here: support.zip

smira commented 4 months ago

The support bundle is empty, so not sure what is going on.

If Omni can reach to the Talos API, talosctl via talosconfig downloaded from Omni should be able to reach as well.

keatsfonam commented 4 months ago

@smira Sorry about that. I fixed the link, it wasn't zipped properly

I agree that it should be able to reach, I am just not sure how to troubleshoot further. Everything works perfectly except talosctl. To me it looks like it is unaware that it should use the wireguard link to talk to the node

bauerjs1 commented 4 months ago

Having the same issue here. I was confused to see a

dial tcp <target node IP>:50000: i/o timeout

error because I thought that talosctl commands should be proxied over Omni and SideroLink and don't use the "direct" connection to the machine port 50000

Now thanks @keatsfonam for digging into this and finding out that SideroLink is only used if one passes the IPv6 address from the Wireguard VPN to talosctl. I think the expected behaviour should be that Omni always uses the SideroLink tunnel.

Unix4ever commented 4 months ago

For the machines running in the cluster any IP should work. This kind of command:

talosctl get rd -n 10.5.0.2 --cluster=talos-default

But if you are using global Omni talosconfig and trying to reach the machine which isn't allocated, you should use either node UUID or ipv6 siderolink endpoint:

talosctl get rd -n <uuid>

I'd say it's easier to just always use node UUID, as it works both in unallocated mode and when the machine is part of a cluster.

keatsfonam commented 4 months ago

@Unix4ever thanks for clarifying

When I use the UUID it points to the IPv4 address and has the same problem:

❯ talosctl get rd -n 4c4c4544-XXX-XXXX-...                                                                                                                                                                                                                                                                 
rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 192.168.100.150:50000: i/o timeout"

It looks like there is a built-in dns resolver. I enabled debug logs and I see:

omni  | {"level":"debug","ts":1721928492.5561075,"caller":"dns/service.go:209","msg":"set node DNS entry","id":"4c4c4544-XXXX-XXXX-..","cluster":"test-cluster","node_name":"talos-627-jvm","address":"192.168.100.150"}

Are you saying this should be the ipv6 management address?~~ Update on above - I was playing with multiple talos configs to debug this and removed that. This seems to work now

--

If I use the management address directly it asks for a client cert and doesn't use the pgp key just generated:

❯ talosctl get rd -n fdae:XXXX:XXXX-...             
rpc error: code = Unavailable desc = last connection error: connection error: desc = "error reading server preface: remote error: tls: certificate required"

I got the cert/key using the break glass and put it into the config and it works with the IPv6 address. It also works fine for the IPv4 address if I remove the omni endpoint as expected.

smira commented 4 months ago

@keatsfonam I looked over the updated support.zip and I can't find anything there.

So two options left:

  1. Check your talosconfig - talosctl config info
  2. Look at the Omni logs when you execute talosctl command.

Note: if you use break the glass talosconfig, then Omni won't provide connectivity to your cluster, so you need to be able to reach it yourself.

keatsfonam commented 4 months ago

The issue with mapping the UUID to management address seems to be a mistake on my part with the multiple talosconfigs I was testing to debug

This core issue seems to be confusion on how the nodes are addressed. The docs are not clear that we are supposed to use the UUID to address the nodes. My expectation is that the sidero link is used for talking to the nodes when using the address listed in the UI.

smira commented 4 months ago

You should never use SideroLink addresses, they are not user-facing. You should use a normal node address or the UUID which Omni would resolve to a node address.

smira commented 4 months ago

And also a hostname should work

keatsfonam commented 4 months ago

@smira thanks. I'll close out this issue then

I expect hostname to work and it's actually the first thing I tried but there doesn't seem to be a DNS entry for node name in the resolver:

❯ talosctl get rd -n talos-ykx-moz
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp: lookup talos-ykx-moz on 127.0.0.53:53: no such host"
smira commented 4 months ago

@smira thanks. I'll close out this issue then

I expect hostname to work and it's actually the first thing I tried but there doesn't seem to be a DNS entry for node name in the resolver:

❯ talosctl get rd -n talos-ykx-moz
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp: lookup talos-ykx-moz on 127.0.0.53:53: no such host"

It should work only via Omni.

Talos can do as well, but it's not enabled by default.

keatsfonam commented 4 months ago

@smira can you clarify what exactly is not enabled by default with talos? This is via the omni endpoint so not quite sure what you mean

smira commented 4 months ago

@smira can you clarify what exactly is not enabled by default with talos? This is via the omni endpoint so not quite sure what you mean

I think this issue is already too much confusing at this point, so let's clarify.

Omni offers access to Talos API through Omni itself. You don't need to have connectivity to the Talos machines when using Omni, but you need to be able to access Omni's endpoint. Omni offers a special form of talosconfig which allows to access Talos API via Omni, it provides unified authn/authz experience which is tied to Omni users. On top of that, Omni offers "magic" name resolver for --nodes flag, which resolves hostnames and machine UUIDs to node addresses.

Second question is Talos API access without Omni. In this case you need to have direct connection to the controlplane nodes of the cluster (endpoints in talosctl terms). Talos API accepts either IP addresses as value of --nodes flags or any other name that can be resolved via DNS. Talos 1.7+ supports a feature to inject machine's hostnames into the DNS resolver.

bauerjs1 commented 3 months ago

@smira @Unix4ever I have the same issue but the tips from above don't seem to help. Should I open a new issue?

With node ID:

talosctl -n 75a22d42-86a6-c872-c6f8-2d8c722a900e dashboard

failed to get node "75a22d42-86a6-c872-c6f8-2d8c722a900e" version: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.22.188:50000: i/o timeout"

With hostname:

talosctl -n some-talos-node dashboard

failed to get node "some-talos-node" version: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.22.188:50000: i/o timeout"

The IP address in the error messages is the public node IP. So it looks to me like Omni refuses to use the Wireguard tunnel (it'd be an IPv6 address then, right?). I did not open port 50000 on the public interface on purpose, because I guess the Talos API should be called via the tunnel.

Talosconfig:

context: omni-poc
contexts:
    omni-poc:
        endpoints:
            - https://omni.mycompany.com/
        auth:
            siderov1:
                identity: me@mycompany.com

Omni logs similar errors:

{"level":"warn","ts":1723622368.7142055,"caller":"zap/options.go:212","msg":"finished streaming call with code Unavailable","component":"server","grpc.start_time":"2024-08-14T07:59:28Z","system":"grpc","span.kind":"server","grpc.service":"machine.MachineService","grpc.method":"Version","peer.address":"10.67.0.34","user_agent":"grpc-go/1.62.1","request_log_initiator":"talos-backend","authenticator.identity":"me@mycompany.com","authenticator.user_id":"8a481fea-ecb7-497b-9032-e1d2c4d20503","authenticator.role":"Admin","nodes":"10.0.22.188","talos-role":"os:operator","error":"rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.22.188:50000: i/o timeout\"","grpc.code":"Unavailable","grpc.time_ms":1.467}

smira commented 3 months ago

Talos Linux (in general, with or without Omni) requires communication between the cluster nodes:

If you block this communication, some things will be broken.

Documentation.

bauerjs1 commented 3 months ago

Sorry I was unclear about the firewall settings – they only apply for layer 3 connections (e.g. from my PC or from Omni). Talos nodes themselves can reach each other just fine since I have them on the same network.

Omni's wireguard endpoint is whitelisted and connection works, still I'm hitting the connection timeouts with talosctl via Omni. Which is why I assume that Omni does not use the tunnel in this case.

smira commented 3 months ago

The error says that they can't talk to each other, but I can't inspect that further from this point.

bauerjs1 commented 3 months ago

Thanks for the feedback! ATM I cannot investigate further, since our test nodes are being used for something else, but I will try to get back to this. I also had KubeSpan enabled in the cluster with default settings and I wonder if this made any difference.

danktec commented 1 month ago

As soon as i specified --cluster=talos-default the commands started working for me.

(i was also previously seeing this error when targeting nodes on their public nat address, while it would succeed locally)