siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
397 stars 23 forks source link

[bug] controller-runtime error #310

Closed xerxist closed 3 weeks ago

xerxist commented 1 month ago

Is there an existing issue for this?

Current Behavior

Talos dmesg is giving these errors every minute.

[talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: EOF\""}

When restarting a node Omni keeps reporting its booting. After a restart of the Omni container itself the problem is gone but the above error keeps coming back every minute.

Omni Version : v0.36.0-beta.0-19-g22e3acf Tested Talos Version: 1.7.2 , 1.7.3, 1.7.4

The kubernetes cluster itself is fine, something is going wrong between the connection of Omni and the Talos node.

Expected Behavior

Reboot and connecting again without having to restart Omni.

Steps To Reproduce

As mentioned above.

What browsers are you seeing the problem on?

Microsoft Edge

Anything else?

No response

smira commented 1 month ago

This is not a problem in Talos itself, but something is wrong with the way you run Omni. We can't guess what is wrong.

xerxist commented 1 month ago

I run Omni in Docker like stated in the documentation. Need to try and go back to a version before 0.36 as I haven't seen this happening before. Only thing I can think of is the upgrade to Ubuntu 24.04 LTS where Docker is on.

xerxist commented 4 weeks ago

Unfortunately I couldn't revert back to an older version as the database gets updated and Omni doesn't allow for me to use anything lower. I've tried the new beta 0.37 but the problem is still there. All is functional though. Not sure what its trying to do that results in this error. Can it be Cilium that is causing the issue?

Let me know if you need more information about the environment.

smira commented 4 weeks ago

You need to find it yourself, something is blocking the connections (Cilium host firewall for example?), or Omni is not running correctly with ports exposed properly.

As it's your own on-prem version, you might need to dig that yourself. If you have a support contract, you can reach out via support channels.

kenlasko commented 3 weeks ago

I see the same frequent error in all my node's logs. It occurs roughly every 1m15s, but does vary +/- 15s. Cilium host firewall is not in use, and Omni is running in a Docker container in host networking mode (as per docs). It doesn't appear to have any negative effects.

[talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: EOF\""}

Every once in a while, I'll see a different, but related error (the source IPv6 address is different for each host): [talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: read tcp [fdae:41e4:649b:9303:62e8:5423:7a06:4b0c]:51402->[fdae:41e4:649b:9303::1]:8090: read: connection reset by peer\""}

Unsure if its related, but looking at the Omni container logs, I do see frequent warnings like this: {"level":"warn","ts":1718110585.813124,"caller":"device/send.go:138","msg":"peer(ca6s…hswI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}

smira commented 3 weeks ago

This is error about communication over the Wireguard/SideroLink tunnel.

Talos can retry errors and workaround networking issues, but as this is on-prem installation, we can't troubleshoot your networking/on-prem setup.

If you have a support contract, please reach out via support channels.

kenlasko commented 3 weeks ago

Don't have a support contract as this is a home-based setup. My network is the very definition of "simple". Thought that since its not limited to a single user, that it might be worth investigating as it might be a bug.

smira commented 3 weeks ago

Yes, please investigate this, if you can find a root cause, that would be great!

xerxist commented 3 weeks ago

I've already tried enabling ipv6 inside the cluster but didn't make any difference.

The cluster was upgraded from 1.6.x to 1.7.4 so it might be interesting to test a cluster build from 1.7.4 and see if the error still persists. Omni makes this easy anyway :)

smira commented 3 weeks ago

This is not related to IPv6/IPv4 directly, as IPv6 is used inside the tunnel. The problem might be in the packet drops/MTU or something similar.

kenlasko commented 3 weeks ago

I'm digging a bit deeper. For each node, this error:

[talos] controller failed {"component": "controller-runtime", "controller": "v1alpha1.EventsSinkController", "error": "error publishing event: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: read tcp [fdae:41e4:649b:9303:62e8:5423:7a06:4b0c]:51402->[fdae:41e4:649b:9303::1]:8090: read: connection reset by peer\""}

...contains what I assume is the node's internal Wireguard IPv6 address. When I run talosctl netstat -n NODENAME | grep 4b0C to focus only on the node's Wireguard NIC, I can see numerous connections tofdae:41e4:649b:9303::1, but none of them are to port 8090. That's the port the EventSinkController appears to listen on. I can connect to that port on my Omni instance via my workstation.

Running a PCAP to see if anything interesting shows up.

Unix4ever commented 3 weeks ago

Interesting I thought we're using 8091 for the events sink, but it looks like we changed it to 8090 at some point.

xerxist commented 3 weeks ago

my docker says --event-sink-port=8091 but the config in omni says 8090 let me give it a try

Unix4ever commented 3 weeks ago

Basically you can try to drop -event-sink-port=8091 from the omni params.

xerxist commented 3 weeks ago

I've swapped the ports and the message is gone.

'--event-sink-port=8090' '--bind-addr=0.0.0.0:443' '--siderolink-api-bind-addr=0.0.0.0:8091

kenlasko commented 3 weeks ago

Should siderolink port be moved to 8091? Everything in Omni seems to be expecting siderolink to be on 8090. I can't imagine that both eventsink and siderolink would use the same port. But maybe it can work, because eventsink is on the Wireguard interface and siderolink is on the eth0 interface?

Essentially, I'm asking if siderolink should remain on the existing port 8090.

kenlasko commented 3 weeks ago

Nope, trying to leave siderolink on 8090 broke Omni. Omni seems to work with siderolink on 8091, but I suspect that node joins will no longer work, since its being advertised as being on 8090.

xerxist commented 3 weeks ago

Yes after a reboot it broke.....

kenlasko commented 3 weeks ago

Try changing siderolink to use 8091:

- --siderolink-api-bind-addr=0.0.0.0:8091

Unix4ever commented 3 weeks ago

There are 2 places where it's configured:

Even if you change the settings in Omni, Talos node will keep joining to 8090.

It works on the same port as they listen on the different interfaces: siderolink-api runs on whatever you set in the machine-api-bind-addr, while the events start only on [fdae:41e4:649b:9303::1] address.

Unix4ever commented 3 weeks ago

same goes for the siderolink-api-bind-addr. The nodes will be still pointed to what they got in the installation media.

kenlasko commented 3 weeks ago

@Unix4ever, what do you think is the best approach to fix this port swap issue for existing clusters? My cluster is error-free at the moment, but from what you're saying, it will fall apart the next time a node reboots?

kenlasko commented 3 weeks ago

Or an easier solution would be to change the Omni code to use 8091 again, since it doesn't seem to respect the settings defined in Docker?

Unix4ever commented 3 weeks ago

We use 8090 for siderolink events as the default value since forever I think. When you first joined the nodes did you use installation media from the Omni instance?

Overall, I think it will keep working, as events are optional since Omni 0.36, Omni now also uses pull model in addition to push to get the current machine state.

Unix4ever commented 3 weeks ago

I need dig more into it, will verify that we properly generate the kernel params for Talos in Omni if a custom events port is used.

If your machines are pointed to siderolink at 8090 it should work if you switch Omni to use 8090 for the events. It does work in the development setup and SaaS like that.

kenlasko commented 3 weeks ago

It didn't work when I tried leaving siderolink at 8090. Omni threw a fit about port in use and wouldn't start. The generated kernel params did update to use my new ports (eventsink=8090, siderolink=8091), but what about existing clusters/nodes that were built using siderolink at 8090?

Unix4ever commented 3 weeks ago

Omni will run if you drop --machine-api-bind-addr=0.0.0.0:8090 or --siderolink-api-bind-addr=0.0.0.0:8090 if you're using deprecated flag.

Anyways. I suspect that there might be some bug.

I remember one thing that got changed recently, now all joined machines get partial Omni join config patch. Will verify that the custom events port generates valid join config patch.

Unix4ever commented 3 weeks ago

Can you please check the kernel param talos.events.sink= value of the machine with the issue? Kernel params are printed in the logs right after the machine is started. And then run:

omnictl get configpatch -o yaml 950-maintenance-config-<machine-uuid>

And send talos event sink config from the response.

I think I'm starting to understand the problem.

Need to compare both places.

xerxist commented 3 weeks ago

If change the --event-sink-port=8091 to 8090 --siderolink-api-bind-addr=0.0.0.0:8090 to 8091 and reboot the Talos node it doesn't connect anymore. Once I change it back to 8091 although the config clearly states its points to 8090 it reports back in.

If I look at the machine-config from Download Machine Join Config it stays at 8090 whether I change the port in the docker config or not.

xerxist commented 3 weeks ago

talos.events.sink=[fdae:41e4:649b:9303::1]:8091

Unix4ever commented 3 weeks ago

If I look at the machine-config from Download Machine Join Config it stays at 8090 whether I change the port in the docker config or not.

Yeah, that's definitely a bug.

kenlasko commented 3 weeks ago

I reverted back to original because the node wouldn't report in with the switched ports.

Current Omni docker-config command section:

      - --account-id=<redacted>
      - --name=onprem-omni
      - --cert=/tls.crt
      - --key=/tls.key
      - --siderolink-api-cert=/tls.crt
      - --siderolink-api-key=/tls.key
      - --private-key-source=file:///omni.asc
      - --event-sink-port=8091
      - --bind-addr=0.0.0.0:4443
      - --siderolink-api-bind-addr=0.0.0.0:8090
      - --k8s-proxy-bind-addr=0.0.0.0:8100
      - --advertised-api-url=https://omni.ucdialplans.com/
      - --siderolink-api-advertised-url=https://omni.ucdialplans.com:8090/
      - --siderolink-wireguard-advertised-addr=192.168.1.17:50180
      - --advertised-kubernetes-proxy-url=https://omni.ucdialplans.com:8100/
      - --auth-auth0-enabled=true
      - --auth-auth0-domain=ucdialplans.us.auth0.com
      - --auth-auth0-client-id=<redacted>
      - --initial-users=ken.lasko@gmail.com

950-maintenance-config

metadata:
    namespace: default
    type: ConfigPatches.omni.sidero.dev
    id: 950-maintenance-config-11b61e33-615e-b264-459a-94c691a81113
    version: 3
    owner: MaintenanceConfigPatchController
    phase: running
    created: 2024-06-06T16:53:43Z
    updated: 2024-06-11T15:48:57Z
    labels:
        omni.sidero.dev/machine: 11b61e33-615e-b264-459a-94c691a81113
        omni.sidero.dev/system-patch:
spec:
    data: |
        apiVersion: v1alpha1
        kind: SideroLinkConfig
        apiUrl: https://omni.ucdialplans.com:8090/?grpc_tunnel=false&jointoken=<redacted>
        ---
        apiVersion: v1alpha1
        kind: EventSinkConfig
        endpoint: '[fdae:41e4:649b:9303::1]:8090'
        ---
        apiVersion: v1alpha1
        kind: KmsgLogConfig
        name: omni-kmsg
        url: tcp://[fdae:41e4:649b:9303::1]:8092
Unix4ever commented 3 weeks ago

Thanks a lot for the data. I will look into it when I get done with whatever is on my plate right now.

kenlasko commented 3 weeks ago

You're welcome. Can't seem to find the kernel params in the logs. I'm looking through the logs in talosctl dashboard. I can give you what Omni says it should be.

xerxist commented 3 weeks ago

Thank you!

Unix4ever commented 3 weeks ago

After my PR gets merged you should be able to keep using Omni with --event-sink-port=8091 flag

kenlasko commented 3 weeks ago

Excellent! Thanks for the quick turnaround.

xerxist commented 3 weeks ago

Cheers! That was quick!

kenlasko commented 3 weeks ago

@xerxist, Omni v0.37.4 is out with the fix! Works like a charm.

xerxist commented 3 weeks ago

Will try now 👍🏼

It updated the EventSinkConfig 🥳