Open Sapp00 opened 1 year ago
Sorry this slipped through the cracks, did you make any progress here?
Hello,
I had the same kind of issue with Talos Linux where each upgrade (of Talos itself) made the newly upgraded node unable to join or be aware of the 2 remaining nodes (where etcd is installed).
After some investigation, I found out that it was stored on /var
which is wiped out if you don't upgrade Talos with a certain parameter.
Because I want /var
to still be cleaned up, I now store it (same for loki) under /opt
.
I can see there is nothing installed under /var
in your case but maybe it can help in your investigations.
Hi,
I just reproduced more/less the same issue:
2 nodes upgraded from 1.7.6 -> 1.8.0: OK
The third one is broken:
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.809717Z","caller":"etcdserver/server.go:858","msg":"starting etcd server","local-member-id":"XXXX","local-server-version":"3.5.13","cluster-id":"XXXX","cluster-version":"3.5"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.810252Z","caller":"etcdserver/server.go:767","msg":"starting initial election tick advance","election-ticks":10}
10.0.0.26: {"level":"warn","ts":"2024-09-28T17:17:03.810302Z","caller":"etcdserver/server.go:1140","msg":"server error","error":"the member has been permanently removed from the cluster"}
10.0.0.26: {"level":"warn","ts":"2024-09-28T17:17:03.810344Z","caller":"etcdserver/server.go:1141","msg":"data-dir used by this member must be removed"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.810744Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/var/lib/etcd/member/snap","suffix":"snap.db","max":5,"interval":"30s"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.811092Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/var/lib/etcd/member/snap","suffix":"snap","max":5,"interval":"30s"}
10.0.0.26: {"level":"info","ts":"2024-09-28T17:17:03.811403Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/var/lib/etcd/member/wal","suffix":"wal","max":5,"interval":"30s"}
How can reset the etcd datas of this node without to reset the node itself?
I wonder if this https://github.com/openebs/mayastor-extensions/pull/536 can help here? @datacore-tilangovan could you please take a look here?
Describe the bug If I remove one node, it cannot join the etcd cluster anymore -> data not accessible anymore. Just could be fixed by deleting the clusters and reapplying it.
To Reproduce Restart one node / reapply OS.
Expected behavior Node should be available again. Did not find any further information in the docs how to fix this issue.
Screenshots
etcd log:
OS info (please complete the following information):