openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
756 stars 109 forks source link

Restarted node cannot be added to cluster again #1326

Open Sapp00 opened 1 year ago

Sapp00 commented 1 year ago

Describe the bug If I remove one node, it cannot join the etcd cluster anymore -> data not accessible anymore. Just could be fixed by deleting the clusters and reapplying it.

To Reproduce Restart one node / reapply OS.

Expected behavior Node should be available again. Did not find any further information in the docs how to fix this issue.

Screenshots

etcd log:

Defaulted container "etcd" out of: etcd, volume-permissions (init)
etcd 10:31:38.91
etcd 10:31:38.91 Welcome to the Bitnami etcd container
etcd 10:31:38.92 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-etcd
etcd 10:31:38.92 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-etcd/issues
etcd 10:31:38.93
etcd 10:31:38.94 INFO  ==> ** Starting etcd setup **
etcd 10:31:38.99 INFO  ==> Validating settings in ETCD_* env vars..
etcd 10:31:39.00 WARN  ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 10:31:39.01 INFO  ==> Initializing etcd
etcd 10:31:39.04 INFO  ==> Detected data from previous deployments
etcd 10:31:49.59 INFO  ==> Member ID wasn't properly stored, the member will try to join the cluster by it's own
etcd 10:31:49.61 INFO  ==> ** etcd setup finished! **

etcd 10:31:49.67 INFO  ==> ** Starting etcd **
2023-02-24 10:31:49.732314 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2379
2023-02-24 10:31:49.732558 I | pkg/flags: recognized and used environment variable ETCD_AUTO_TLS=false
2023-02-24 10:31:49.732606 I | pkg/flags: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=false
2023-02-24 10:31:49.732673 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/bitnami/etcd/data
2023-02-24 10:31:49.733275 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2380
2023-02-24 10:31:49.733393 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=mayastor-etcd-0=http://mayastor-etcd-0.mayastor-etcd-headless.mayastor.svc.cluster.local:2380,mayastor-etcd-1=http://mayastor-etcd-1.mayastor-etcd-headless.mayastor.svc.cluster.local:2380,mayastor-etcd-2=http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2380
2023-02-24 10:31:49.733428 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
2023-02-24 10:31:49.733785 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-k8s
2023-02-24 10:31:49.733921 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
2023-02-24 10:31:49.733968 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2023-02-24 10:31:49.733999 I | pkg/flags: recognized and used environment variable ETCD_LOG_LEVEL=info
2023-02-24 10:31:49.734381 I | pkg/flags: recognized and used environment variable ETCD_NAME=mayastor-etcd-2
2023-02-24 10:31:49.734711 I | pkg/flags: recognized and used environment variable ETCD_PEER_AUTO_TLS=false
2023-02-24 10:31:49.735091 W | pkg/flags: unrecognized environment variable ETCD_TRUSTED_CA_FILE=
2023-02-24 10:31:49.735212 W | pkg/flags: unrecognized environment variable ETCD_ON_K8S=yes
2023-02-24 10:31:49.735240 W | pkg/flags: unrecognized environment variable ETCD_SNAPSHOTS_DIR=/snapshots
2023-02-24 10:31:49.735546 W | pkg/flags: unrecognized environment variable ETCD_BIN_DIR=/opt/bitnami/etcd/sbin
2023-02-24 10:31:49.735659 W | pkg/flags: unrecognized environment variable ETCD_VOLUME_DIR=/bitnami/etcd
2023-02-24 10:31:49.735712 W | pkg/flags: unrecognized environment variable ETCD_ROOT_PASSWORD=
2023-02-24 10:31:49.735991 W | pkg/flags: unrecognized environment variable ETCD_CLUSTER_DOMAIN=mayastor-etcd-headless.mayastor.svc.cluster.local
2023-02-24 10:31:49.736101 W | pkg/flags: unrecognized environment variable ETCD_DISASTER_RECOVERY=no
2023-02-24 10:31:49.736131 W | pkg/flags: unrecognized environment variable ETCD_KEY_FILE=
2023-02-24 10:31:49.736440 W | pkg/flags: unrecognized environment variable ETCD_DAEMON_GROUP=etcd
2023-02-24 10:31:49.736550 W | pkg/flags: unrecognized environment variable ETCD_START_FROM_SNAPSHOT=no
2023-02-24 10:31:49.736584 W | pkg/flags: unrecognized environment variable ETCD_INIT_SNAPSHOT_FILENAME=
2023-02-24 10:31:49.736896 W | pkg/flags: unrecognized environment variable ETCD_INIT_SNAPSHOTS_DIR=/init-snapshot
2023-02-24 10:31:49.737011 W | pkg/flags: unrecognized environment variable ETCD_BASE_DIR=/opt/bitnami/etcd
2023-02-24 10:31:49.737328 W | pkg/flags: unrecognized environment variable ETCD_CERT_FILE=
2023-02-24 10:31:49.737440 W | pkg/flags: unrecognized environment variable ETCD_NEW_MEMBERS_ENV_FILE=/bitnami/etcd/data/new_member_envs
2023-02-24 10:31:49.737471 W | pkg/flags: unrecognized environment variable ETCD_DAEMON_USER=etcd
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2023-02-24 10:31:49.737949 I | etcdmain: etcd Version: 3.4.15
2023-02-24 10:31:49.737976 I | etcdmain: Git SHA: aa7126864
2023-02-24 10:31:49.737999 I | etcdmain: Go Version: go1.12.17
2023-02-24 10:31:49.738471 I | etcdmain: Go OS/Arch: linux/amd64
2023-02-24 10:31:49.738577 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2023-02-24 10:31:49.739185 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2023-02-24 10:31:49.740650 I | embed: name = mayastor-etcd-2
2023-02-24 10:31:49.740757 I | embed: data dir = /bitnami/etcd/data
2023-02-24 10:31:49.740785 I | embed: member dir = /bitnami/etcd/data/member
2023-02-24 10:31:49.740809 I | embed: heartbeat = 100ms
2023-02-24 10:31:49.740832 I | embed: election = 1000ms
2023-02-24 10:31:49.740855 I | embed: snapshot count = 100000
2023-02-24 10:31:49.740898 I | embed: advertise client URLs = http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2379
2023-02-24 10:31:49.741407 I | embed: initial advertise peer URLs = http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2380
2023-02-24 10:31:49.741527 I | embed: initial cluster =
2023-02-24 10:31:49.746091 I | etcdserver: restarting member 946609bde8186189 in cluster ab0688bf84af917d at commit index 3
raft2023/02/24 10:31:49 INFO: 946609bde8186189 switched to configuration voters=()
raft2023/02/24 10:31:49 INFO: 946609bde8186189 became follower at term 1
raft2023/02/24 10:31:49 INFO: newRaft 946609bde8186189 [peers: [], term: 1, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
2023-02-24 10:31:49.750237 W | auth: simple token is not cryptographically signed
2023-02-24 10:31:49.762242 I | etcdserver: starting server... [version: 3.4.15, cluster version: to_be_decided]
raft2023/02/24 10:31:49 INFO: 946609bde8186189 switched to configuration voters=(5200617686277819413)
2023-02-24 10:31:49.769346 I | etcdserver/membership: added member 482c4e3b4b340c15 [http://mayastor-etcd-0.mayastor-etcd-headless.mayastor.svc.cluster.local:2380] to cluster ab0688bf84af917d
2023-02-24 10:31:49.769447 I | rafthttp: starting peer 482c4e3b4b340c15...
2023-02-24 10:31:49.769562 I | rafthttp: started HTTP pipelining with peer 482c4e3b4b340c15
2023-02-24 10:31:49.775541 I | rafthttp: started streaming with peer 482c4e3b4b340c15 (writer)
2023-02-24 10:31:49.781063 I | rafthttp: started streaming with peer 482c4e3b4b340c15 (writer)
2023-02-24 10:31:49.782623 I | rafthttp: started peer 482c4e3b4b340c15
2023-02-24 10:31:49.783022 I | rafthttp: started streaming with peer 482c4e3b4b340c15 (stream MsgApp v2 reader)
2023-02-24 10:31:49.783415 I | rafthttp: started streaming with peer 482c4e3b4b340c15 (stream Message reader)
2023-02-24 10:31:49.784072 I | rafthttp: added peer 482c4e3b4b340c15
raft2023/02/24 10:31:49 INFO: 946609bde8186189 switched to configuration voters=(5200617686277819413 7753051432874073703)
2023-02-24 10:31:49.785183 I | etcdserver/membership: added member 6b985f3365cade67 [http://mayastor-etcd-1.mayastor-etcd-headless.mayastor.svc.cluster.local:2380] to cluster ab0688bf84af917d
2023-02-24 10:31:49.785459 I | rafthttp: starting peer 6b985f3365cade67...
2023-02-24 10:31:49.788107 I | rafthttp: started HTTP pipelining with peer 6b985f3365cade67
2023-02-24 10:31:49.796647 I | rafthttp: started streaming with peer 6b985f3365cade67 (writer)
2023-02-24 10:31:49.803560 I | rafthttp: started peer 6b985f3365cade67
2023-02-24 10:31:49.804164 I | rafthttp: added peer 6b985f3365cade67
raft2023/02/24 10:31:49 INFO: 946609bde8186189 switched to configuration voters=(5200617686277819413 7753051432874073703 10693245076485202313)
2023-02-24 10:31:49.809404 I | etcdserver/membership: added member 946609bde8186189 [http://mayastor-etcd-2.mayastor-etcd-headless.mayastor.svc.cluster.local:2380] to cluster ab0688bf84af917d
2023-02-24 10:31:49.809806 I | embed: listening for peers on [::]:2380
2023-02-24 10:31:49.810204 I | rafthttp: started streaming with peer 6b985f3365cade67 (stream MsgApp v2 reader)
2023-02-24 10:31:49.812135 I | rafthttp: started streaming with peer 6b985f3365cade67 (stream Message reader)
2023-02-24 10:31:49.815422 I | rafthttp: started streaming with peer 6b985f3365cade67 (writer)
2023-02-24 10:31:49.820103 E | etcdserver: the member has been permanently removed from the cluster
2023-02-24 10:31:49.820380 I | etcdserver: the data-dir used by this member must be removed.
2023-02-24 10:31:49.820698 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.821046 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.821369 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.823356 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.824994 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.825532 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.825980 E | etcdserver: publish error: etcdserver: request cancelled
2023-02-24 10:31:49.826420 I | etcdserver: aborting publish because server is stopped
2023-02-24 10:31:49.826896 I | rafthttp: stopping peer 482c4e3b4b340c15...
2023-02-24 10:31:49.827779 I | rafthttp: stopped streaming with peer 482c4e3b4b340c15 (writer)
2023-02-24 10:31:49.828244 I | rafthttp: stopped streaming with peer 482c4e3b4b340c15 (writer)
2023-02-24 10:31:49.828607 I | rafthttp: stopped HTTP pipelining with peer 482c4e3b4b340c15
2023-02-24 10:31:49.829644 I | rafthttp: stopped streaming with peer 482c4e3b4b340c15 (stream MsgApp v2 reader)
2023-02-24 10:31:49.829754 I | rafthttp: stopped streaming with peer 482c4e3b4b340c15 (stream Message reader)
2023-02-24 10:31:49.829794 I | rafthttp: stopped peer 482c4e3b4b340c15
2023-02-24 10:31:49.829825 I | rafthttp: stopping peer 6b985f3365cade67...
2023-02-24 10:31:49.829864 I | rafthttp: stopped streaming with peer 6b985f3365cade67 (writer)
2023-02-24 10:31:49.829904 I | rafthttp: stopped streaming with peer 6b985f3365cade67 (writer)
2023-02-24 10:31:49.829968 I | rafthttp: stopped HTTP pipelining with peer 6b985f3365cade67
2023-02-24 10:31:49.830022 I | rafthttp: stopped streaming with peer 6b985f3365cade67 (stream MsgApp v2 reader)
2023-02-24 10:31:49.830123 I | rafthttp: stopped streaming with peer 6b985f3365cade67 (stream Message reader)
2023-02-24 10:31:49.830158 I | rafthttp: stopped peer 6b985f3365cade67

OS info (please complete the following information):

tiagolobocastro commented 10 months ago

Sorry this slipped through the cracks, did you make any progress here?

ryayon commented 9 months ago

Hello, I had the same kind of issue with Talos Linux where each upgrade (of Talos itself) made the newly upgraded node unable to join or be aware of the 2 remaining nodes (where etcd is installed). After some investigation, I found out that it was stored on /var which is wiped out if you don't upgrade Talos with a certain parameter. Because I want /var to still be cleaned up, I now store it (same for loki) under /opt. I can see there is nothing installed under /var in your case but maybe it can help in your investigations.

anthosz commented 2 months ago

Hi,

I just reproduced more/less the same issue:

2 nodes upgraded from 1.7.6 -> 1.8.0: OK

The third one is broken:

10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.809717Z","caller":"etcdserver/server.go:858","msg":"starting etcd server","local-member-id":"XXXX","local-server-version":"3.5.13","cluster-id":"XXXX","cluster-version":"3.5"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.810252Z","caller":"etcdserver/server.go:767","msg":"starting initial election tick advance","election-ticks":10}
10.0.0.26: {"level":"warn","ts":"2024-09-28T17:17:03.810302Z","caller":"etcdserver/server.go:1140","msg":"server error","error":"the member has been permanently removed from the cluster"}
10.0.0.26: {"level":"warn","ts":"2024-09-28T17:17:03.810344Z","caller":"etcdserver/server.go:1141","msg":"data-dir used by this member must be removed"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.810744Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/var/lib/etcd/member/snap","suffix":"snap.db","max":5,"interval":"30s"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.811092Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/var/lib/etcd/member/snap","suffix":"snap","max":5,"interval":"30s"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.811403Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/var/lib/etcd/member/wal","suffix":"wal","max":5,"interval":"30s"}

How can reset the etcd datas of this node without to reset the node itself?

tiagolobocastro commented 1 month ago

I wonder if this https://github.com/openebs/mayastor-extensions/pull/536 can help here? @datacore-tilangovan could you please take a look here?