sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.66k stars 447 forks source link

sentinel don't failover then node fail #794

Open deepdivenow opened 4 years ago

deepdivenow commented 4 years ago

What happened: This problem is not exists on stolon versions 0.10.0 or 0.15.0

Environment: 3 centos v7 or v8 nodes each node installed:

[root@centos17 ~]# stolonctl status === Active sentinels ===

ID LEADER 718f47de false 79a71967 true b17ab76b false

=== Active proxies ===

ID 5207501e 974cf4dc b41ff2ef

=== Keepers ===

UID HEALTHY PG LISTENADDRESS PG HEALTHY PG WANTEDGENERATION PG CURRENTGENERATION pgcentos17 true 192.168.1.117:5432 true 5 5
pgcentos18 true 192.168.1.118:5432 true 2 2
pgcentos19 true 192.168.1.119:5432 true 5 5

=== Cluster Info ===

Master Keeper: pgcentos19

===== Keepers/DB tree =====

pgcentos19 (master) ├─pgcentos18 └─pgcentos17

On node fail (master node) All other sentinels holds in state: [root@centos17 ~]# stolonctl status === Active sentinels ===

No active sentinels

=== Active proxies ===

No active proxies

=== Keepers ===

UID HEALTHY PG LISTENADDRESS PG HEALTHY PG WANTEDGENERATION PG CURRENTGENERATION pgcentos17 true 192.168.1.117:5432 true 5 5
pgcentos18 true 192.168.1.118:5432 true 2 2
pgcentos19 true 192.168.1.119:5432 true 5 5

=== Cluster Info ===

Master Keeper: pgcentos19

===== Keepers/DB tree =====

pgcentos19 (master) ├─pgcentos18 └─pgcentos17

Jul 31 08:31:08 centos17 stolon-sentinel[21844]: 2020-07-31T08:31:08.797-0400 INFO cmd/sentinel.go:1964 sentinel uid {"uid": "79a71967"} Jul 31 08:31:08 centos17 stolon-sentinel[21844]: 2020-07-31T08:31:08.798-0400 INFO cmd/sentinel.go:82 Trying to acquire sentinels leadership Jul 31 08:31:26 centos17 stolon-sentinel[21844]: 2020-07-31T08:31:26.496-0400 INFO cmd/sentinel.go:89 sentinel leadership acquired Jul 31 08:32:39 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:32:39.733-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:32:39 centos17 stolon-sentinel[21844]: 2020-07-31T08:32:39.733-0400 ERROR cmd/sentinel.go:1807 error retrieving cluster data {"error": "context deadline exceeded"} Jul 31 08:32:49 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:32:49.743-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:32:49 centos17 stolon-sentinel[21844]: 2020-07-31T08:32:49.743-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:32:53 centos17 stolon-sentinel[21844]: 2020-07-31T08:32:53.926-0400 INFO cmd/sentinel.go:94 sentinel leadership lost Jul 31 08:32:59 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:32:59.755-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:32:59 centos17 stolon-sentinel[21844]: 2020-07-31T08:32:59.755-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:09 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:09.767-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:09 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:09.768-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:19 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:19.780-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:19 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:19.780-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:29 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:29.791-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:29 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:29.791-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:39 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:39.803-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:39 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:39.803-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:49 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:49.815-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:49 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:49.815-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"} Jul 31 08:33:59 centos17 stolon-sentinel[21844]: {"level":"warn","ts":"2020-07-31T08:33:59.827-0400","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-65e92e5b-07bf-47c3-b800> Jul 31 08:33:59 centos17 stolon-sentinel[21844]: 2020-07-31T08:33:59.827-0400 ERROR cmd/sentinel.go:1844 cannot update sentinel info {"error": "context deadline exceeded"}

What you expected to happen: Failover works fine like older stolon versions 0.10.0 | 0.15.0

How to reproduce it (as minimally and precisely as possible): Install / Configure stolon release v0.16.0 with etcd v3.3/v3.4 with APIv3 Shutdown keeper master node (1 of 3 instances sentinel/keeper/proxy/etcd) my configs http://kislyak.com/ff/stolon_failover_bug.tgz

Anything else we need to know?:

Environment:

deepdivenow commented 4 years ago

I took some investigation. All components of stolon cluster (sentinel/keeper/proxy) failen, then any etcd cluster node was failed. STORE_ENDPOINTS set to list of all etcd noles if STORE_ENDPOINTS set to one node -> cluster works fine. etcdv3 library used in stolon 0.15.0 on create etcdclientv3.New -> create etcd client with HealthBalancer by default this balancer check etcd cluster status every 3 seconds.

etcdv3 library in stolon 0.16.0 don't use health check balancer, and by default don't check etcd cluster status. I think need set additional options on etcdv3client creation For example like this internal/store/kvBackend.go


        config := etcdclientv3.Config{
            Endpoints:            addrs,
            TLS:                  tlsConfig,
            DialKeepAliveTime:    time.Second * 15,
            DialKeepAliveTimeout: time.Second * 5,
        }

        c, err := etcdclientv3.New(config)