Closed scrayos closed 2 years ago
you might need to check etcd
logs on all nodes to see what is up.
what is happening:
So you probably might need to inspect talosctl etcd members
for sanity and talosctl logs etcd
.
To retry joining you can do talosctl reset -n IP --system-labels-to-wipe=EPHEMERAL --reboot=true --graceful=false
are you upgrading all controlplane nodes at once?
<ipv4-address>: user: warning: [2022-06-22T16:18:53.745298313Z]: [talos] task startAllServices (1/1): service "etcd" to be "up"
<ipv4-address>: user: warning: [2022-06-22T16:19:08.745172313Z]: [talos] task startAllServices (1/1): service "etcd" to be "up"
<ipv4-address>: user: warning: [2022-06-22T16:19:18.781830313Z]: [talos] failed promoting member: 3 error(s) occurred:
<ipv4-address>: user: warning: [2022-06-22T16:19:18.789519313Z]: etcdserver: can only promote a learner member which is in sync with leader
<ipv4-address>: user: warning: [2022-06-22T16:19:18.794144313Z]: etcdserver: rpc not supported for learner
<ipv4-address>: user: warning: [2022-06-22T16:19:18.796576313Z]: timeout
also i see this in the logs, seems it cannot connect to the other etcd members, possibly some network issues
I was upgrading a single control-plane node with talosctl -n <ip> upgrade --image ghcr.io/siderolabs/installer:v1.1.0
.
But yeah, a network problem seems certainly possible. But I wouldn't really know where it's coming from. I closely followed the official guide for hcloud/hetzner cloud and deployed them with a load-balancer. The only difference is probably that I disabled the cni, because we use Cilium in strict mode.
I was upgrading a single control-plane node with talosctl -n
upgrade --image ghcr.io/siderolabs/installer:v1.1.0.
the cli shouldn;t allow you to upgrade a single control plane node unless --preserve
is set, did you by any chance used --force
?
So what happened here is that the upgrade process wiped the ephermeral data, but since there was no other control planes to re-create the etcd data it failed
hmmm... it looks like etcd joined something according to the logs
so I assumed there's something running
the cli shouldn;t allow you to upgrade a single control plane node unless --preserve is set, did you by any chance used --force?
No, I did not use --force
. In fact, I copied the excact command that I executed, but that's good to know. My talosconfig has one "endpoint" and four "nodes" like:
endpoint:
- 'controlplane-ip1'
nodes:
- 'worker-ip1'
- 'controlplane-ip1'
- 'controlplane-ip2'
- 'controlplane-ip3'
maybe that is wrong? I think the docs are a little bit thin regarding the addition of more nodes than the initial cp-node to the talosconfig.
Also, the docs regarding the upgrade are a little bit misleading in my opinion. If I remember correctly, the controlplane nodes are also upgraded individually in the explanation video and the command example also shows it with the --nodes <ip>
flag. And the thing that ultimately made me upgrade the one controlplane node individually was the last answer of the FAQ:
Q. Can I break my cluster by upgrading everything at once?
A. Maybe - it’s not recommended.
Nothing prevents the user from sending near-simultaneous upgrades to each node of the cluster - and while Talos Linux and Kubernetes can generally deal with this situation, other components of the cluster may not be able to recover from more than one node rebooting at a time. (e.g. any software that maintains a quorum or state across nodes, such as Rook/Ceph)
I was upgrading a single control-plane node with talosctl -n
upgrade --image ghcr.io/siderolabs/installer:v1.1.0.
apologies, I mistakenly assumed from the above message you had a single control plane node.
What does talosctl logs etcd
on the node that failed to upgrade shows?
Sadly, I've redeployed the server already, but I tried to reproduce the problem with three new controlplane nodes and it seems like there is a broader problem. I am now unable to get the third server to join the etcd cluster (even without the update). The servers reused the same IP addresses of the previous cluster – maybe that is a problem with the Talos Discovery Service and a long ttl?
The server seems to have problem connecting to the Talos Discovery Service: talos-no-connection-log.txt (it is a v1.0.6 server and I've deployed it exactly as described in the official docs for hetzner cloud)
The only real problem I see in the log is that your control plane endpoint doesn't work.
maybe that is a problem with the Talos Discovery Service and a long ttl
the default ttl is 30mins, it's recommended to create a new set of machine config for a new cluster and not re-use the existing ones
even with Discovery Service TTL, it doesn't make much sense.
during the upgrade, Talos controlplane node leaves etcd cluster, performs an upgrade, reboots and rejoins etcd back.
In order to rejoin etcd, Talos 1.0 requires connection to the Kubernetes control plane endpoint; Talos 1.1+ can also use discovery service data.
From your logs it seems that control plane endpoint is up, I wonder if that might be related.
I have the exact same problem upgrading the last node in my cluster from 1.0.6 to 1.1.0.
talos-cp-1: user: warning: [2022-06-30T08:28:12.975244856Z]: [talos] service[etcd](Running): Health check failed: etcdserver: rpc not supported for learner
talos-cp-1: user: warning: [2022-06-30T08:28:18.529949856Z]: [talos] task startAllServices (1/1): service "etcd" to be "up"
talos-cp-1: user: warning: [2022-06-30T08:28:33.501711856Z]: [talos] task startAllServices (1/1): service "etcd" to be "up"
talos-cp-1: user: warning: [2022-06-30T08:28:48.493170856Z]: [talos] task startAllServices (1/1): service "etcd" to be "up"
Here's the output of talosctl logs etcd
for the corresponding node:
from the log above it looks like etcd
is copying the data from other nodes, and it was still doing that after 30 seconds which is completely normal. etcd
should become healthy and promote itself to a full member
The health check is still reporting as failed on this node:
NODE SERVICE STATE HEALTH LAST CHANGE LAST EVENT
talos-cp-1 apid Running OK 13m28s ago Health check successful
talos-cp-1 containerd Running OK 13m34s ago Health check successful
talos-cp-1 cri Running OK 13m32s ago Health check successful
talos-cp-1 etcd Running Fail 13m24s ago Health check failed: etcdserver: rpc not supported for learner
talos-cp-1 kubelet Running OK 13m30s ago Health check successful
talos-cp-1 machined Running ? 13m40s ago Service started as goroutine
talos-cp-1 trustd Running OK 13m32s ago Health check successful
talos-cp-1 udevd Running OK 13m34s ago Health check successful
so the root cause I guess is whatever makes etcd so slow to join the cluster.... (probably?)
I don't have any specific idea about it. What you could do is an attempt to re-join the cluster with:
talosctl reset --system-labels-to-wipe=STATE,EPHEMERAL --reboot --graceful=false
if you could capture the full dmesg
after the reboot, that will be perfect
Okay, I did that, here is the dmesg
output:
And here's the etcd
(which is now panicking in a restart loop) log output:
not sure what exactly is wrong, but there's some membership issue with etcd
.
you might need to check talosctl etcd members
on the healthy cp node
according to the etcd
log, it was started with two other peers which seems correct, but after that something went wrong on the etcd side
I manually removed the node from etcd
and reset the node again (with --graceful=true
, this time). After applying the config again it rejoined correctly this time.
This issue didn't occur for quite some time for me and I think it was originally caused by a missing port in the cluster.controlPlane.endpoint
. After appending the correct port (by default 6443
), everything worked fine. So I'll close this.
Bug Report
Description
When I tried to upgrade a control-plane-node from version 1.0.6 to 1.1.0 (also happened during the upgrade from 1.0.5 to 1.0.6), the control plane node could not re-enter the cluster. I was able to observe the state of the cluster with
talosctl dmesg -f
during the upgrade and I've attached it below.The only components that are installed on the cluster are cilium (in strict mode), rook-ceph and rook-ceph-cluster. All of them are installed through helm.
Configuration
This is the configuration for the control-plane-nodes (without the secrets of course):
I also tried deploying a new server without the explicit version pins (except for the installer version) but that also blocked during etcd. During the update from 1.0.5 to 1.0.6 I was only able to "fix" it by deploying a completely new cluster (and then freshly bootstraping etcd).
Logs
The log from the talos_log.txt
Environment