Talos install error - couldn't get current server API group list: - tls: internal error

onedr0p / cluster-template

A template for deploying a Talos Kubernetes cluster including Flux for GitOps

MIT License

1.93k stars 271 forks source link

Talos install error - couldn't get current server API group list: - tls: internal error #1401

Closed DavidIlie closed 7 months ago

DavidIlie commented 8 months ago

Following the pathway to install Talos, I have an issue where the cluster does seem to setup but the workers do not join, and the masters have these errors:

error refreshing pod status and the error is related to TLS (tls: internal error) and also controller failed errors too

this is what I see in my terminal while setting it up

onedr0p commented 8 months ago

I think this is expected due to etcd not being bootstrapped?

https://github.com/onedr0p/cluster-template/blob/b6234fcbad3f9b3be13f94a563b6b73cbe3741f5/.taskfiles/Talos/Taskfile.yaml#L55-L63

It doesn't seem like the above was ran. At what point in the task commands did you get to and was there any errors on the client side?

bojanraic commented 8 months ago

In addition to @onedr0p's comments, when I tried to install Talos manually, it took some time for the master to be ready. In both of the screenshots, uptime is only a few minutes so maybe it hasn't finished bootstrapping yet. I went back to k3s in the meantime, but please update us on your Talos install progress via this template and I may take another stab at it when time permits. Good luck!

DavidIlie commented 8 months ago

I left it running the whole night yesterday and the same thing happened. I am also sure that I think all scripts are running and before the node first reboots/loads there are errors regarding something like a "admin" certificate

Any ideas?

onedr0p commented 8 months ago

Maybe give it another shot when you have a moment? Not sure what happened here to be honest could be a ton of different issues from misconfig to network issues to anything else really :/

The important bits of the config that can really go wrong if not set right are the network and disk selectors.

DavidIlie commented 8 months ago

Disk selectors work I believe, data is being written to the disk and network is working o I believe on all the nodes.

The error is just the "tls: internal error" every time the masters try to fetch something from their own localhost IP

DavidIlie commented 8 months ago

The bootstrap first begins with these errors in the console

But I believe that's when the nodes get rebooted as then it boots and continues til kubelet is healthy but the error is back:

And then my terminal tries to connect to the VIP and nothing happens

onedr0p commented 8 months ago

I saw this in your previous config (sorry this is all I have to go on from https://github.com/onedr0p/cluster-template/issues/1398#issuecomment-2029711711)

    networkInterfaces:
      - deviceSelector:
          hardwareAddr: ""

That should be the nodes mac address, are you sure this is populated? It should be in xx:xx:xx:xx:xx:xx format and be unique per-node.

https://github.com/onedr0p/cluster-template/blob/28ae26d3a8d1b3ea79f8552ae420faf93366dda5/config.sample.yaml#L57

onedr0p commented 8 months ago

I added validation on talos_nic here to hopefully catch this for other people in the future.

DavidIlie commented 8 months ago

I already populated those, I just redacted them when I sent it here. Every single value is present

DavidIlie commented 8 months ago

https://github.com/onedr0p/cluster-template/assets/47594764/302633b5-723c-431d-af21-d0d041ca1c81

This is a recording of what happens

onedr0p commented 8 months ago

I wonder if you need to use a different type of network selector in the Talos/talhelper config or change something in the NIC settings on the VM in Proxmox?

I just hand-held someone thru the whole repo who is using bare-metal nodes and we had success after figuring out they were not setting the correct value for talos_nic which lead me to commit validation on that.

DavidIlie commented 8 months ago

Do you have an example of what I would need to do?

onedr0p commented 8 months ago

I am probably not the best person to ask about that as I do not use any hypervisors in my life right now 😄

Maybe a good start is to review the talos proxmox docs and see if everything lines up there and with the rendered config here.

onedr0p commented 8 months ago

Keep in mind there are a bunch of different network selectors you can use so maybe mac address is not the best with PVE? I dunno.