Restarted node cannot be added to cluster again

Sapp00 commented 1 year ago

Describe the bug If I remove one node, it cannot join the etcd cluster anymore -> data not accessible anymore. Just could be fixed by deleting the clusters and reapplying it.

To Reproduce Restart one node / reapply OS.

Expected behavior Node should be available again. Did not find any further information in the docs how to fix this issue.

Screenshots

etcd log:

Defaulted container "etcd" out of: etcd, volume-permissions (init)
etcd 10:31:38.91
etcd 10:31:38.91 Welcome to the Bitnami etcd container
etcd 10:31:38.92 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-etcd
etcd 10:31:38.92 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-etcd/issues
etcd 10:31:38.93
etcd 10:31:38.94 INFO  ==> ** Starting etcd setup **
etcd 10:31:38.99 INFO  ==> Validating settings in ETCD_* env vars..
etcd 10:31:39.00 WARN  ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 10:31:39.01 INFO  ==> Initializing etcd
etcd 10:31:39.04 INFO  ==> Detected data from previous deployments
etcd 10:31:49.59 INFO  ==> Member ID wasn't properly stored, the member will try to join the cluster by it's own
etcd 10:31:49.61 INFO  ==> ** etcd setup finished! **

etcd 10:31:49.67 INFO  ==> ** Starting etcd **
2023-02-24 10:31:49.732314 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2379
2023-02-24 10:31:49.732558 I | pkg/flags: recognized and used environment variable ETCD_AUTO_TLS=false
2023-02-24 10:31:49.732606 I | pkg/flags: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=false
2023-02-24 10:31:49.732673 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/bitnami/etcd/data
2023-02-24 10:31:49.733275 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2380
2023-02-24 10:31:49.733393 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=mayastor-etcd-0=http://mayastor-etcd-0.mayastor-etcd-headless.mayastor.svc.cluster.local:2380,mayastor-etcd-1=http://mayastor-etcd-1.mayastor-etcd-headless.mayastor.svc.cluster.local:2380,mayastor-etcd-2=http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2380
2023-02-24 10:31:49.733428 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
2023-02-24 10:31:49.733785 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-k8s
2023-02-24 10:31:49.733921 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
2023-02-24 10:31:49.733968 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2023-02-24 10:31:49.733999 I | pkg/flags: recognized and used environment variable ETCD_LOG_LEVEL=info
2023-02-24 10:31:49.734381 I | pkg/flags: recognized and used environment variable ETCD_NAME=mayastor-etcd-2
2023-02-24 10:31:49.734711 I | pkg/flags: recognized and used environment variable ETCD_PEER_AUTO_TLS=false
2023-02-24 10:31:49.735091 W | pkg/flags: unrecognized environment variable ETCD_TRUSTED_CA_FILE=
2023-02-24 10:31:49.735212 W | pkg/flags: unrecognized environment variable ETCD_ON_K8S=yes
2023-02-24 10:31:49.735240 W | pkg/flags: unrecognized environment variable ETCD_SNAPSHOTS_DIR=/snapshots
2023-02-24 10:31:49.735546 W | pkg/flags: unrecognized environment variable ETCD_BIN_DIR=/opt/bitnami/etcd/sbin
2023-02-24 10:31:49.735659 W | pkg/flags: unrecognized environment variable ETCD_VOLUME_DIR=/bitnami/etcd
2023-02-24 10:31:49.735712 W | pkg/flags: unrecognized environment variable ETCD_ROOT_PASSWORD=
2023-02-24 10:31:49.735991 W | pkg/flags: unrecognized environment variable ETCD_CLUSTER_DOMAIN=mayastor-etcd-headless.mayastor.svc.cluster.local
2023-02-24 10:31:49.736101 W | pkg/flags: unrecognized environment variable ETCD_DISASTER_RECOVERY=no
2023-02-24 10:31:49.736131 W | pkg/flags: unrecognized environment variable ETCD_KEY_FILE=
2023-02-24 10:31:49.736440 W | pkg/flags: unrecognized environment variable ETCD_DAEMON_GROUP=etcd
2023-02-24 10:31:49.736550 W | pkg/flags: unrecognized environment variable ETCD_START_FROM_SNAPSHOT=no
2023-02-24 10:31:49.736584 W | pkg/flags: unrecognized environment variable ETCD_INIT_SNAPSHOT_FILENAME=
2023-02-24 10:31:49.736896 W | pkg/flags: unrecognized environment variable ETCD_INIT_SNAPSHOTS_DIR=/init-snapshot
2023-02-24 10:31:49.737011 W | pkg/flags: unrecognized environment variable ETCD_BASE_DIR=/opt/bitnami/etcd
2023-02-24 10:31:49.737328 W | pkg/flags: unrecognized environment variable ETCD_CERT_FILE=
2023-02-24 10:31:49.737440 W | pkg/flags: unrecognized environment variable ETCD_NEW_MEMBERS_ENV_FILE=/bitnami/etcd/data/new_member_envs
2023-02-24 10:31:49.737471 W | pkg/flags: unrecognized environment variable ETCD_DAEMON_USER=etcd
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2023-02-24 10:31:49.737949 I | etcdmain: etcd Version: 3.4.15
2023-02-24 10:31:49.737976 I | etcdmain: Git SHA: aa7126864
2023-02-24 10:31:49.737999 I | etcdmain: Go Version: go1.12.17
2023-02-24 10:31:49.738471 I | etcdmain: Go OS/Arch: linux/amd64
2023-02-24 10:31:49.738577 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2023-02-24 10:31:49.739185 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2023-02-24 10:31:49.740650 I | embed: name = mayastor-etcd-2
2023-02-24 10:31:49.740757 I | embed: data dir = /bitnami/etcd/data
2023-02-24 10:31:49.740785 I | embed: member dir = /bitnami/etcd/data/member
2023-02-24 10:31:49.740809 I | embed: heartbeat = 100ms
2023-02-24 10:31:49.740832 I | embed: election = 1000ms
2023-02-24 10:31:49.740855 I | embed: snapshot count = 100000
2023-02-24 10:31:49.740898 I | embed: advertise client URLs = http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2379
2023-02-24 10:31:49.741407 I | embed: initial advertise peer URLs = http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2380
2023-02-24 10:31:49.741527 I | embed: initial cluster =
2023-02-24 10:31:49.746091 I | etcdserver: restarting member 946609bde8186189 in cluster ab0688bf84af917d at commit index 3
raft2023/02/24 10:31:49 INFO: 946609bde8186189 switched to configuration voters=()
raft2023/02/24 10:31:49 INFO: 946609bde8186189 became follower at term 1
raft2023/02/24 10:31:49 INFO: newRaft 946609bde8186189 [peers: [], term: 1, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
2023-02-24 10:31:49.750237 W | auth: simple token is not cryptographically signed
2023-02-24 10:31:49.762242 I | etcdserver: starting server... [version: 3.4.15, cluster version: to_be_decided]
raft2023/02/24 10:31:49 INFO: 946609bde8186189 switched to configuration voters=(5200617686277819413)
2023-02-24 10:31:49.769346 I | etcdserver/membership: added member 482c4e3b4b340c15 [http://mayastor-etcd-0.mayastor-etcd-headless.mayastor.svc.cluster.local:2380] to cluster ab0688bf84af917d
2023-02-24 10:31:49.769447 I | rafthttp: starting peer 482c4e3b4b340c15...
2023-02-24 10:31:49.769562 I | rafthttp: started HTTP pipelining with peer 482c4e3b4b340c15
2023-02-24 10:31:49.775541 I | rafthttp: started streaming with peer 482c4e3b4b340c15 (writer)
2023-02-24 10:31:49.781063 I | rafthttp: started streaming with peer 482c4e3b4b340c15 (writer)
2023-02-24 10:31:49.782623 I | rafthttp: started peer 482c4e3b4b340c15
2023-02-24 10:31:49.783022 I | rafthttp: started streaming with peer 482c4e3b4b340c15 (stream MsgApp v2 reader)
2023-02-24 10:31:49.783415 I | rafthttp: started streaming with peer 482c4e3b4b340c15 (stream Message reader)
2023-02-24 10:31:49.784072 I | rafthttp: added peer 482c4e3b4b340c15
raft2023/02/24 10:31:49 INFO: 946609bde8186189 switched to configuration voters=(5200617686277819413 7753051432874073703)
2023-02-24 10:31:49.785183 I | etcdserver/membership: added member 6b985f3365cade67 [http://mayastor-etcd-1.mayastor-etcd-headless.mayastor.svc.cluster.local:2380] to cluster ab0688bf84af917d
2023-02-24 10:31:49.785459 I | rafthttp: starting peer 6b985f3365cade67...
2023-02-24 10:31:49.788107 I | rafthttp: started HTTP pipelining with peer 6b985f3365cade67
2023-02-24 10:31:49.796647 I | rafthttp: started streaming with peer 6b985f3365cade67 (writer)
2023-02-24 10:31:49.803560 I | rafthttp: started peer 6b985f3365cade67
2023-02-24 10:31:49.804164 I | rafthttp: added peer 6b985f3365cade67
raft2023/02/24 10:31:49 INFO: 946609bde8186189 switched to configuration voters=(5200617686277819413 7753051432874073703 10693245076485202313)
2023-02-24 10:31:49.809404 I | etcdserver/membership: added member 946609bde8186189 [http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2380] to cluster ab0688bf84af917d
2023-02-24 10:31:49.809806 I | embed: listening for peers on [::]:2380
2023-02-24 10:31:49.810204 I | rafthttp: started streaming with peer 6b985f3365cade67 (stream MsgApp v2 reader)
2023-02-24 10:31:49.812135 I | rafthttp: started streaming with peer 6b985f3365cade67 (stream Message reader)
2023-02-24 10:31:49.815422 I | rafthttp: started streaming with peer 6b985f3365cade67 (writer)
2023-02-24 10:31:49.820103 E | etcdserver: the member has been permanently removed from the cluster
2023-02-24 10:31:49.820380 I | etcdserver: the data-dir used by this member must be removed.
2023-02-24 10:31:49.820698 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.821046 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.821369 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.823356 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.824994 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.825532 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.825980 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.826420 I | etcdserver: aborting publish because server is stopped
2023-02-24 10:31:49.826896 I | rafthttp: stopping peer 482c4e3b4b340c15...
2023-02-24 10:31:49.827779 I | rafthttp: stopped streaming with peer 482c4e3b4b340c15 (writer)
2023-02-24 10:31:49.828244 I | rafthttp: stopped streaming with peer 482c4e3b4b340c15 (writer)
2023-02-24 10:31:49.828607 I | rafthttp: stopped HTTP pipelining with peer 482c4e3b4b340c15
2023-02-24 10:31:49.829644 I | rafthttp: stopped streaming with peer 482c4e3b4b340c15 (stream MsgApp v2 reader)
2023-02-24 10:31:49.829754 I | rafthttp: stopped streaming with peer 482c4e3b4b340c15 (stream Message reader)
2023-02-24 10:31:49.829794 I | rafthttp: stopped peer 482c4e3b4b340c15
2023-02-24 10:31:49.829825 I | rafthttp: stopping peer 6b985f3365cade67...
2023-02-24 10:31:49.829864 I | rafthttp: stopped streaming with peer 6b985f3365cade67 (writer)
2023-02-24 10:31:49.829904 I | rafthttp: stopped streaming with peer 6b985f3365cade67 (writer)
2023-02-24 10:31:49.829968 I | rafthttp: stopped HTTP pipelining with peer 6b985f3365cade67
2023-02-24 10:31:49.830022 I | rafthttp: stopped streaming with peer 6b985f3365cade67 (stream MsgApp v2 reader)
2023-02-24 10:31:49.830123 I | rafthttp: stopped streaming with peer 6b985f3365cade67 (stream Message reader)
2023-02-24 10:31:49.830158 I | rafthttp: stopped peer 6b985f3365cade67

OS info (please complete the following information):

Distro: Talos v1.3.5
Kernel version: 5.19.94
MayaStor revision or container image: v1.0.5

tiagolobocastro commented 10 months ago

Sorry this slipped through the cracks, did you make any progress here?

ryayon commented 9 months ago

Hello, I had the same kind of issue with Talos Linux where each upgrade (of Talos itself) made the newly upgraded node unable to join or be aware of the 2 remaining nodes (where etcd is installed). After some investigation, I found out that it was stored on /var which is wiped out if you don't upgrade Talos with a certain parameter. Because I want /var to still be cleaned up, I now store it (same for loki) under /opt. I can see there is nothing installed under /var in your case but maybe it can help in your investigations.

anthosz commented 2 months ago

Hi,

I just reproduced more/less the same issue:

2 nodes upgraded from 1.7.6 -> 1.8.0: OK

The third one is broken:

10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.809717Z","caller":"etcdserver/server.go:858","msg":"starting etcd server","local-member-id":"XXXX","local-server-version":"3.5.13","cluster-id":"XXXX","cluster-version":"3.5"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.810252Z","caller":"etcdserver/server.go:767","msg":"starting initial election tick advance","election-ticks":10}
10.0.0.26: {"level":"warn","ts":"2024-09-28T17:17:03.810302Z","caller":"etcdserver/server.go:1140","msg":"server error","error":"the member has been permanently removed from the cluster"}
10.0.0.26: {"level":"warn","ts":"2024-09-28T17:17:03.810344Z","caller":"etcdserver/server.go:1141","msg":"data-dir used by this member must be removed"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.810744Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/var/lib/etcd/member/snap","suffix":"snap.db","max":5,"interval":"30s"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.811092Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/var/lib/etcd/member/snap","suffix":"snap","max":5,"interval":"30s"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.811403Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/var/lib/etcd/member/wal","suffix":"wal","max":5,"interval":"30s"}

How can reset the etcd datas of this node without to reset the node itself?

tiagolobocastro commented 1 month ago

I wonder if this https://github.com/openebs/mayastor-extensions/pull/536 can help here? @datacore-tilangovan could you please take a look here?

openebs / mayastor

Restarted node cannot be added to cluster again #1326