Closed keatsfonam closed 4 months ago
Can you please attach omnictl support
bundle?
Attached here: support.zip
The support bundle is empty, so not sure what is going on.
If Omni can reach to the Talos API, talosctl via talosconfig
downloaded from Omni should be able to reach as well.
@smira Sorry about that. I fixed the link, it wasn't zipped properly
I agree that it should be able to reach, I am just not sure how to troubleshoot further. Everything works perfectly except talosctl. To me it looks like it is unaware that it should use the wireguard link to talk to the node
Having the same issue here. I was confused to see a
dial tcp <target node IP>:50000: i/o timeout
error because I thought that talosctl
commands should be proxied over Omni and SideroLink and don't use the "direct" connection to the machine port 50000
Now thanks @keatsfonam for digging into this and finding out that SideroLink is only used if one passes the IPv6 address from the Wireguard VPN to talosctl
. I think the expected behaviour should be that Omni always uses the SideroLink tunnel.
For the machines running in the cluster any IP should work. This kind of command:
talosctl get rd -n 10.5.0.2 --cluster=talos-default
But if you are using global Omni talosconfig and trying to reach the machine which isn't allocated, you should use either node UUID or ipv6 siderolink endpoint:
talosctl get rd -n <uuid>
I'd say it's easier to just always use node UUID, as it works both in unallocated mode and when the machine is part of a cluster.
@Unix4ever thanks for clarifying
When I use the UUID it points to the IPv4 address and has the same problem:
❯ talosctl get rd -n 4c4c4544-XXX-XXXX-...
rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 192.168.100.150:50000: i/o timeout"
It looks like there is a built-in dns resolver. I enabled debug logs and I see:
omni | {"level":"debug","ts":1721928492.5561075,"caller":"dns/service.go:209","msg":"set node DNS entry","id":"4c4c4544-XXXX-XXXX-..","cluster":"test-cluster","node_name":"talos-627-jvm","address":"192.168.100.150"}
Are you saying this should be the ipv6 management address?~~ Update on above - I was playing with multiple talos configs to debug this and removed that. This seems to work now
--
If I use the management address directly it asks for a client cert and doesn't use the pgp key just generated:
❯ talosctl get rd -n fdae:XXXX:XXXX-...
rpc error: code = Unavailable desc = last connection error: connection error: desc = "error reading server preface: remote error: tls: certificate required"
I got the cert/key using the break glass and put it into the config and it works with the IPv6 address. It also works fine for the IPv4 address if I remove the omni endpoint as expected.
@keatsfonam I looked over the updated support.zip
and I can't find anything there.
So two options left:
talosconfig
- talosctl config info
talosctl
command.Note: if you use break the glass talosconfig, then Omni won't provide connectivity to your cluster, so you need to be able to reach it yourself.
The issue with mapping the UUID to management address seems to be a mistake on my part with the multiple talosconfigs I was testing to debug
This core issue seems to be confusion on how the nodes are addressed. The docs are not clear that we are supposed to use the UUID to address the nodes. My expectation is that the sidero link is used for talking to the nodes when using the address listed in the UI.
You should never use SideroLink addresses, they are not user-facing. You should use a normal node address or the UUID which Omni would resolve to a node address.
And also a hostname should work
@smira thanks. I'll close out this issue then
I expect hostname to work and it's actually the first thing I tried but there doesn't seem to be a DNS entry for node name in the resolver:
❯ talosctl get rd -n talos-ykx-moz
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp: lookup talos-ykx-moz on 127.0.0.53:53: no such host"
@smira thanks. I'll close out this issue then
I expect hostname to work and it's actually the first thing I tried but there doesn't seem to be a DNS entry for node name in the resolver:
❯ talosctl get rd -n talos-ykx-moz rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp: lookup talos-ykx-moz on 127.0.0.53:53: no such host"
It should work only via Omni.
Talos can do as well, but it's not enabled by default.
@smira can you clarify what exactly is not enabled by default with talos? This is via the omni endpoint so not quite sure what you mean
@smira can you clarify what exactly is not enabled by default with talos? This is via the omni endpoint so not quite sure what you mean
I think this issue is already too much confusing at this point, so let's clarify.
Omni offers access to Talos API through Omni itself. You don't need to have connectivity to the Talos machines when using Omni, but you need to be able to access Omni's endpoint. Omni offers a special form of talosconfig
which allows to access Talos API via Omni, it provides unified authn/authz experience which is tied to Omni users. On top of that, Omni offers "magic" name resolver for --nodes
flag, which resolves hostnames and machine UUIDs to node addresses.
Second question is Talos API access without Omni. In this case you need to have direct connection to the controlplane nodes of the cluster (endpoints
in talosctl
terms). Talos API accepts either IP addresses as value of --nodes
flags or any other name that can be resolved via DNS. Talos 1.7+ supports a feature to inject machine's hostnames into the DNS resolver.
@smira @Unix4ever I have the same issue but the tips from above don't seem to help. Should I open a new issue?
With node ID:
talosctl -n 75a22d42-86a6-c872-c6f8-2d8c722a900e dashboard
failed to get node "75a22d42-86a6-c872-c6f8-2d8c722a900e" version: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.22.188:50000: i/o timeout"
With hostname:
talosctl -n some-talos-node dashboard
failed to get node "some-talos-node" version: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.22.188:50000: i/o timeout"
The IP address in the error messages is the public node IP. So it looks to me like Omni refuses to use the Wireguard tunnel (it'd be an IPv6 address then, right?). I did not open port 50000 on the public interface on purpose, because I guess the Talos API should be called via the tunnel.
Talosconfig:
context: omni-poc
contexts:
omni-poc:
endpoints:
- https://omni.mycompany.com/
auth:
siderov1:
identity: me@mycompany.com
Omni logs similar errors:
{"level":"warn","ts":1723622368.7142055,"caller":"zap/options.go:212","msg":"finished streaming call with code Unavailable","component":"server","grpc.start_time":"2024-08-14T07:59:28Z","system":"grpc","span.kind":"server","grpc.service":"machine.MachineService","grpc.method":"Version","peer.address":"10.67.0.34","user_agent":"grpc-go/1.62.1","request_log_initiator":"talos-backend","authenticator.identity":"me@mycompany.com","authenticator.user_id":"8a481fea-ecb7-497b-9032-e1d2c4d20503","authenticator.role":"Admin","nodes":"10.0.22.188","talos-role":"os:operator","error":"rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.22.188:50000: i/o timeout\"","grpc.code":"Unavailable","grpc.time_ms":1.467}
Talos Linux (in general, with or without Omni) requires communication between the cluster nodes:
If you block this communication, some things will be broken.
Sorry I was unclear about the firewall settings – they only apply for layer 3 connections (e.g. from my PC or from Omni). Talos nodes themselves can reach each other just fine since I have them on the same network.
Omni's wireguard endpoint is whitelisted and connection works, still I'm hitting the connection timeouts with talosctl
via Omni. Which is why I assume that Omni does not use the tunnel in this case.
The error says that they can't talk to each other, but I can't inspect that further from this point.
Thanks for the feedback! ATM I cannot investigate further, since our test nodes are being used for something else, but I will try to get back to this. I also had KubeSpan enabled in the cluster with default settings and I wonder if this made any difference.
As soon as i specified --cluster=talos-default
the commands started working for me.
(i was also previously seeing this error when targeting nodes on their public nat address, while it would succeed locally)
Is there an existing issue for this?
Current Behavior
I am testing a single node cluster created with omni. When using talosctl through the omni endpoint it is unable to connect to the node (192.168.100.150) in question:
The node is reachable within the private network it lives in, but omni cannot route to it. omnictl and kubectl are functioning properly through an omni endpoint. All omni functionality seems to be intact.
When testing using the node itself as an endpoint traffic is flowing properly and stops at the point of asking for a client cert:
Test the node can reach it's own IP:
When testing using the ipv6 management address from the tunnel talosctl can talk to the node:
Expected Behavior
talosctl can talk to the node through the omni endpoint
Steps To Reproduce
Single node cluster. Omni is on second network and cannot route to the private network directly. Node
talosctl version: v1.7.5 omni version: v1.30.1
talosconfig:
omni flags
What browsers are you seeing the problem on?
No response
Anything else?
I will bring up more nodes to continue testing but I would expect it to work with a single node