Closed xerxist closed 3 weeks ago
This is not a problem in Talos itself, but something is wrong with the way you run Omni. We can't guess what is wrong.
I run Omni in Docker like stated in the documentation. Need to try and go back to a version before 0.36 as I haven't seen this happening before. Only thing I can think of is the upgrade to Ubuntu 24.04 LTS where Docker is on.
Unfortunately I couldn't revert back to an older version as the database gets updated and Omni doesn't allow for me to use anything lower. I've tried the new beta 0.37 but the problem is still there. All is functional though. Not sure what its trying to do that results in this error. Can it be Cilium that is causing the issue?
Let me know if you need more information about the environment.
You need to find it yourself, something is blocking the connections (Cilium host firewall for example?), or Omni is not running correctly with ports exposed properly.
As it's your own on-prem version, you might need to dig that yourself. If you have a support contract, you can reach out via support channels.
I see the same frequent error in all my node's logs. It occurs roughly every 1m15s, but does vary +/- 15s. Cilium host firewall is not in use, and Omni is running in a Docker container in host networking mode (as per docs). It doesn't appear to have any negative effects.
[talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: EOF\""}
Every once in a while, I'll see a different, but related error (the source IPv6 address is different for each host):
[talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: read tcp [fdae:41e4:649b:9303:62e8:5423:7a06:4b0c]:51402->[fdae:41e4:649b:9303::1]:8090: read: connection reset by peer\""}
Unsure if its related, but looking at the Omni container logs, I do see frequent warnings like this:
{"level":"warn","ts":1718110585.813124,"caller":"device/send.go:138","msg":"peer(ca6s…hswI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
This is error about communication over the Wireguard/SideroLink tunnel.
Talos can retry errors and workaround networking issues, but as this is on-prem installation, we can't troubleshoot your networking/on-prem setup.
If you have a support contract, please reach out via support channels.
Don't have a support contract as this is a home-based setup. My network is the very definition of "simple". Thought that since its not limited to a single user, that it might be worth investigating as it might be a bug.
Yes, please investigate this, if you can find a root cause, that would be great!
I've already tried enabling ipv6 inside the cluster but didn't make any difference.
The cluster was upgraded from 1.6.x to 1.7.4 so it might be interesting to test a cluster build from 1.7.4 and see if the error still persists. Omni makes this easy anyway :)
This is not related to IPv6/IPv4 directly, as IPv6 is used inside the tunnel. The problem might be in the packet drops/MTU or something similar.
I'm digging a bit deeper. For each node, this error:
[talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: read tcp [fdae:41e4:649b:9303:62e8:5423:7a06:4b0c]:51402->[fdae:41e4:649b:9303::1]:8090: read: connection reset by peer\""}
...contains what I assume is the node's internal Wireguard IPv6 address. When I run talosctl netstat -n NODENAME | grep 4b0C
to focus only on the node's Wireguard NIC, I can see numerous connections tofdae:41e4:649b:9303::1
, but none of them are to port 8090. That's the port the EventSinkController appears to listen on. I can connect to that port on my Omni instance via my workstation.
Running a PCAP to see if anything interesting shows up.
Interesting I thought we're using 8091
for the events sink, but it looks like we changed it to 8090
at some point.
my docker says --event-sink-port=8091 but the config in omni says 8090 let me give it a try
Basically you can try to drop -event-sink-port=8091
from the omni
params.
I've swapped the ports and the message is gone.
'--event-sink-port=8090' '--bind-addr=0.0.0.0:443' '--siderolink-api-bind-addr=0.0.0.0:8091
Should siderolink port be moved to 8091? Everything in Omni seems to be expecting siderolink to be on 8090. I can't imagine that both eventsink and siderolink would use the same port. But maybe it can work, because eventsink is on the Wireguard interface and siderolink is on the eth0 interface?
Essentially, I'm asking if siderolink should remain on the existing port 8090.
Nope, trying to leave siderolink on 8090 broke Omni. Omni seems to work with siderolink on 8091, but I suspect that node joins will no longer work, since its being advertised as being on 8090.
Yes after a reboot it broke.....
Try changing siderolink to use 8091:
- --siderolink-api-bind-addr=0.0.0.0:8091
There are 2 places where it's configured:
--event-sink-port
talos.events.sink=[fdae:41e4:649b:9303::1]:8090
Even if you change the settings in Omni, Talos node will keep joining to 8090.
It works on the same port as they listen on the different interfaces: siderolink-api
runs on whatever you set in the machine-api-bind-addr
, while the events start only on [fdae:41e4:649b:9303::1]
address.
same goes for the siderolink-api-bind-addr
. The nodes will be still pointed to what they got in the installation media.
@Unix4ever, what do you think is the best approach to fix this port swap issue for existing clusters? My cluster is error-free at the moment, but from what you're saying, it will fall apart the next time a node reboots?
Or an easier solution would be to change the Omni code to use 8091 again, since it doesn't seem to respect the settings defined in Docker?
We use 8090 for siderolink events as the default value since forever I think. When you first joined the nodes did you use installation media from the Omni instance?
Overall, I think it will keep working, as events are optional since Omni 0.36, Omni now also uses pull model in addition to push to get the current machine state.
I need dig more into it, will verify that we properly generate the kernel params for Talos in Omni if a custom events port is used.
If your machines are pointed to siderolink at 8090
it should work if you switch Omni to use 8090
for the events.
It does work in the development setup and SaaS like that.
It didn't work when I tried leaving siderolink at 8090. Omni threw a fit about port in use and wouldn't start. The generated kernel params did update to use my new ports (eventsink=8090, siderolink=8091), but what about existing clusters/nodes that were built using siderolink at 8090?
Omni will run if you drop --machine-api-bind-addr=0.0.0.0:8090
or --siderolink-api-bind-addr=0.0.0.0:8090
if you're using deprecated flag.
Anyways. I suspect that there might be some bug.
I remember one thing that got changed recently, now all joined machines get partial Omni
join config patch.
Will verify that the custom events port generates valid join config patch.
Can you please check the kernel param talos.events.sink=
value of the machine with the issue?
Kernel params are printed in the logs right after the machine is started.
And then run:
omnictl get configpatch -o yaml 950-maintenance-config-<machine-uuid>
And send talos event sink config from the response.
I think I'm starting to understand the problem.
Need to compare both places.
If change the --event-sink-port=8091 to 8090 --siderolink-api-bind-addr=0.0.0.0:8090 to 8091 and reboot the Talos node it doesn't connect anymore. Once I change it back to 8091 although the config clearly states its points to 8090 it reports back in.
If I look at the machine-config from Download Machine Join Config it stays at 8090 whether I change the port in the docker config or not.
talos.events.sink=[fdae:41e4:649b:9303::1]:8091
If I look at the machine-config from Download Machine Join Config it stays at 8090 whether I change the port in the docker config or not.
Yeah, that's definitely a bug.
I reverted back to original because the node wouldn't report in with the switched ports.
Current Omni docker-config command section:
- --account-id=<redacted>
- --name=onprem-omni
- --cert=/tls.crt
- --key=/tls.key
- --siderolink-api-cert=/tls.crt
- --siderolink-api-key=/tls.key
- --private-key-source=file:///omni.asc
- --event-sink-port=8091
- --bind-addr=0.0.0.0:4443
- --siderolink-api-bind-addr=0.0.0.0:8090
- --k8s-proxy-bind-addr=0.0.0.0:8100
- --advertised-api-url=https://omni.ucdialplans.com/
- --siderolink-api-advertised-url=https://omni.ucdialplans.com:8090/
- --siderolink-wireguard-advertised-addr=192.168.1.17:50180
- --advertised-kubernetes-proxy-url=https://omni.ucdialplans.com:8100/
- --auth-auth0-enabled=true
- --auth-auth0-domain=ucdialplans.us.auth0.com
- --auth-auth0-client-id=<redacted>
- --initial-users=ken.lasko@gmail.com
950-maintenance-config
metadata:
namespace: default
type: ConfigPatches.omni.sidero.dev
id: 950-maintenance-config-11b61e33-615e-b264-459a-94c691a81113
version: 3
owner: MaintenanceConfigPatchController
phase: running
created: 2024-06-06T16:53:43Z
updated: 2024-06-11T15:48:57Z
labels:
omni.sidero.dev/machine: 11b61e33-615e-b264-459a-94c691a81113
omni.sidero.dev/system-patch:
spec:
data: |
apiVersion: v1alpha1
kind: SideroLinkConfig
apiUrl: https://omni.ucdialplans.com:8090/?grpc_tunnel=false&jointoken=<redacted>
---
apiVersion: v1alpha1
kind: EventSinkConfig
endpoint: '[fdae:41e4:649b:9303::1]:8090'
---
apiVersion: v1alpha1
kind: KmsgLogConfig
name: omni-kmsg
url: tcp://[fdae:41e4:649b:9303::1]:8092
Thanks a lot for the data. I will look into it when I get done with whatever is on my plate right now.
You're welcome. Can't seem to find the kernel params in the logs. I'm looking through the logs in talosctl dashboard
. I can give you what Omni says it should be.
Thank you!
After my PR gets merged you should be able to keep using Omni with --event-sink-port=8091
flag
Excellent! Thanks for the quick turnaround.
Cheers! That was quick!
@xerxist, Omni v0.37.4 is out with the fix! Works like a charm.
Will try now 👍🏼
It updated the EventSinkConfig 🥳
Is there an existing issue for this?
Current Behavior
Talos dmesg is giving these errors every minute.
[talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: EOF\""}
When restarting a node Omni keeps reporting its booting. After a restart of the Omni container itself the problem is gone but the above error keeps coming back every minute.
Omni Version : v0.36.0-beta.0-19-g22e3acf Tested Talos Version: 1.7.2 , 1.7.3, 1.7.4
The kubernetes cluster itself is fine, something is going wrong between the connection of Omni and the Talos node.
Expected Behavior
Reboot and connecting again without having to restart Omni.
Steps To Reproduce
As mentioned above.
What browsers are you seeing the problem on?
Microsoft Edge
Anything else?
No response