Closed DavidIlie closed 7 months ago
I think this is expected due to etcd not being bootstrapped?
It doesn't seem like the above was ran. At what point in the task commands did you get to and was there any errors on the client side?
In addition to @onedr0p's comments, when I tried to install Talos manually, it took some time for the master to be ready. In both of the screenshots, uptime is only a few minutes so maybe it hasn't finished bootstrapping yet. I went back to k3s in the meantime, but please update us on your Talos install progress via this template and I may take another stab at it when time permits. Good luck!
I left it running the whole night yesterday and the same thing happened. I am also sure that I think all scripts are running and before the node first reboots/loads there are errors regarding something like a "admin" certificate
Any ideas?
Maybe give it another shot when you have a moment? Not sure what happened here to be honest could be a ton of different issues from misconfig to network issues to anything else really :/
The important bits of the config that can really go wrong if not set right are the network and disk selectors.
Disk selectors work I believe, data is being written to the disk and network is working o I believe on all the nodes.
The error is just the "tls: internal error" every time the masters try to fetch something from their own localhost IP
The bootstrap first begins with these errors in the console
But I believe that's when the nodes get rebooted as then it boots and continues til kubelet is healthy but the error is back:
And then my terminal tries to connect to the VIP and nothing happens
I saw this in your previous config (sorry this is all I have to go on from https://github.com/onedr0p/cluster-template/issues/1398#issuecomment-2029711711)
networkInterfaces:
- deviceSelector:
hardwareAddr: ""
That should be the nodes mac address, are you sure this is populated? It should be in xx:xx:xx:xx:xx:xx
format and be unique per-node.
I added validation on talos_nic
here to hopefully catch this for other people in the future.
I already populated those, I just redacted them when I sent it here. Every single value is present
https://github.com/onedr0p/cluster-template/assets/47594764/302633b5-723c-431d-af21-d0d041ca1c81
This is a recording of what happens
I wonder if you need to use a different type of network selector in the Talos/talhelper config or change something in the NIC settings on the VM in Proxmox?
I just hand-held someone thru the whole repo who is using bare-metal nodes and we had success after figuring out they were not setting the correct value for talos_nic
which lead me to commit validation on that.
Do you have an example of what I would need to do?
I am probably not the best person to ask about that as I do not use any hypervisors in my life right now 😄
Maybe a good start is to review the talos proxmox docs and see if everything lines up there and with the rendered config here.
Keep in mind there are a bunch of different network selectors you can use so maybe mac address is not the best with PVE? I dunno.
Following the pathway to install Talos, I have an issue where the cluster does seem to setup but the workers do not join, and the masters have these errors:
error refreshing pod status and the error is related to TLS (tls: internal error) and also controller failed errors too
this is what I see in my terminal while setting it up