Open deepdivenow opened 4 years ago
I took some investigation. All components of stolon cluster (sentinel/keeper/proxy) failen, then any etcd cluster node was failed. STORE_ENDPOINTS set to list of all etcd noles if STORE_ENDPOINTS set to one node -> cluster works fine. etcdv3 library used in stolon 0.15.0 on create etcdclientv3.New -> create etcd client with HealthBalancer by default this balancer check etcd cluster status every 3 seconds.
etcdv3 library in stolon 0.16.0 don't use health check balancer, and by default don't check etcd cluster status. I think need set additional options on etcdv3client creation For example like this internal/store/kvBackend.go
config := etcdclientv3.Config{
Endpoints: addrs,
TLS: tlsConfig,
DialKeepAliveTime: time.Second * 15,
DialKeepAliveTimeout: time.Second * 5,
}
c, err := etcdclientv3.New(config)
What happened: This problem is not exists on stolon versions 0.10.0 or 0.15.0
Environment: 3 centos v7 or v8 nodes each node installed:
[root@centos17 ~]# stolonctl status === Active sentinels ===
ID LEADER 718f47de false 79a71967 true b17ab76b false
=== Active proxies ===
ID 5207501e 974cf4dc b41ff2ef
=== Keepers ===
UID HEALTHY PG LISTENADDRESS PG HEALTHY PG WANTEDGENERATION PG CURRENTGENERATION pgcentos17 true 192.168.1.117:5432 true 5 5
pgcentos18 true 192.168.1.118:5432 true 2 2
pgcentos19 true 192.168.1.119:5432 true 5 5
=== Cluster Info ===
Master Keeper: pgcentos19
===== Keepers/DB tree =====
pgcentos19 (master) ├─pgcentos18 └─pgcentos17
On node fail (master node) All other sentinels holds in state: [root@centos17 ~]# stolonctl status === Active sentinels ===
No active sentinels
=== Active proxies ===
No active proxies
=== Keepers ===
UID HEALTHY PG LISTENADDRESS PG HEALTHY PG WANTEDGENERATION PG CURRENTGENERATION pgcentos17 true 192.168.1.117:5432 true 5 5
pgcentos18 true 192.168.1.118:5432 true 2 2
pgcentos19 true 192.168.1.119:5432 true 5 5
=== Cluster Info ===
Master Keeper: pgcentos19
===== Keepers/DB tree =====
pgcentos19 (master) ├─pgcentos18 └─pgcentos17
Jul 31 08:31:08 centos17 stolon-sentinel[21844]: 2020-07-31T08:31:08.797-0400 INFO cmd/sentinel.go:1964 sentinel uid {"uid": "79a71967"} Jul 31 08:31:08 centos17 stolon-sentinel[21844]: 2020-07-31T08:31:08.798-0400 INFO cmd/sentinel.go:82 Trying to acquire sentinels leadership Jul 31 08:31:26 centos17 stolon-sentinel[21844]: 2020-07-31T08:31:26.496-0400 INFO cmd/sentinel.go:89 sentinel leadership acquired Jul 31 08:32:39 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:32:39.733-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:32:39 centos17 stolon-sentinel[21844]: 2020-07-31T08:32:39.733-0400 ERROR cmd/sentinel.go:1807 error retrieving cluster data {"error": "context deadline exceeded"} Jul 31 08:32:49 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:32:49.743-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:32:49 centos17 stolon-sentinel[21844]: 2020-07-31T08:32:49.743-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:32:53 centos17 stolon-sentinel[21844]: 2020-07-31T08:32:53.926-0400 INFO cmd/sentinel.go:94 sentinel leadership lost Jul 31 08:32:59 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:32:59.755-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:32:59 centos17 stolon-sentinel[21844]: 2020-07-31T08:32:59.755-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:09 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:09.767-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:09 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:09.768-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:19 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:19.780-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:19 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:19.780-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:29 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:29.791-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:29 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:29.791-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:39 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:39.803-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:39 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:39.803-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:49 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:49.815-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:49 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:49.815-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:59 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:59.827-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:59 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:59.827-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"}
What you expected to happen: Failover works fine like older stolon versions 0.10.0 | 0.15.0
How to reproduce it (as minimally and precisely as possible): Install / Configure stolon release v0.16.0 with etcd v3.3/v3.4 with APIv3 Shutdown keeper master node (1 of 3 instances sentinel/keeper/proxy/etcd) my configs http://kislyak.com/ff/stolon_failover_bug.tgz
Anything else we need to know?:
Environment: