sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.62k stars 443 forks source link

sentinel pod no leader and restart begin OK #875

Closed jackin853 closed 2 years ago

jackin853 commented 2 years ago

What happened: At present, the cluster is a keeper, a proxy, and a sentinel. I restarted the server today. Sentinel did not elect a leader, and the status remained false. I don’t know why the status is always false. I execute “ "kubectl delete sentinel pod"”,pod automatic restart, everything is OK, I don't understand why the sentinel pod doesn't keep retrying, and directly set its state to false

Enter sentinel pod execute: stolonctl status

root@stolon-sentinel-w9np5:/# stolonctl status --cluster-name=kube-stolon --store-backend=etcd --store-endpoints=http://matrix-node2:2379,http://matrix-node3:2379,http://matrix-node1:2379
=== Active sentinels ===

ID      LEADER
9fb958b5    false

=== Active proxies ===

ID
40ceac30

=== Keepers ===

No keepers available

No cluster available

However, the ETCD cluster has been inaccessible, and the sentinel pod has reported the following error. I don't know if it has anything to do with this:

2022-05-09T13:48:33.085+0800    INFO    cmd/sentinel.go:2000    sentinel uid    {"uid": "9fb958b5"}
2022-05-09T13:48:33.106+0800    INFO    cmd/sentinel.go:82  Trying to acquire sentinels leadership
2022-05-09T13:48:48.107+0800    ERROR   cmd/sentinel.go:1843    error retrieving cluster data   {"error": "client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://matrix-node3:2379 exceeded header timeout\n; error #1: client: endpoint http://matrix-node1:2379 exceeded header timeout\n; error #2: client: endpoint http://matrix-node2:2379 exceeded header timeout\n"}
2022-05-09T13:49:08.108+0800    ERROR   cmd/sentinel.go:1843    error retrieving cluster data   {"error": "client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://matrix-node3:2379 exceeded header timeout\n; error #1: client: endpoint http://matrix-node1:2379 exceeded header timeout\n; error #2: client: endpoint http://matrix-node2:2379 exceeded header timeout\n"}
2022-05-09T13:49:28.109+0800    ERROR   cmd/sentinel.go:1843    error retrieving cluster data   {"error": "client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://matrix-node3:2379 exceeded header timeout\n; error #1: client: endpoint http://matrix-node1:2379 exceeded header timeout\n; error #2: client: endpoint http://matrix-node2:2379 exceeded header timeout\n"}
sgotti commented 2 years ago

@jackin853 From the logs the etcd cluster wasn't accessible so the sentinels cannot do election or anything else. You should try to understand why the sentinel cannot communicate with etcd. I don't see a stolon related bug here.

jackin853 commented 2 years ago

@sgotti What I may have described is not particularly accurate. After etcd is unavailable, etcd is restored later. What I want to know is why etcd is restored. Sentinel should continue to elect the leader? Even if there is only one sentinel pod, the sentinel pod has been in a state of suspended animation

sgotti commented 2 years ago

@jackin853 So the issue you're describing is that the etcd cluster become unavailable, then it returned available but the sentinel never acquired the leadership? If so:

jackin853 commented 2 years ago

@sgotti Currrent etcd version:

[root@hgsa1 ~]# etcdctl --version
etcdctl version: 3.5.1
API version: 2

This seems to be an accidental phenomenon and I can't reproduce the issue. Does using etcdv2 and etcdv3 have any effect on sentinel components? And I'm curious, why Sentinel is in a state of suspended animation, and what causes this problem? Can this problem be avoided by using etcdv3? Sorry, not enough logs were saved when this issue occurred. After this problem occurred, the customer asked to restore it as soon as possible, so I restarted the sentinel pod and it was restored inside