sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.66k stars 447 forks source link

consul doesn't expire proxy keys #397

Closed nh2 closed 6 years ago

nh2 commented 6 years ago

Submission type

Environment

NixOS

Stolon version

0.7.0

Additional environment information if useful to understand the bug

My stolon cluster with synchronous replication went into a state where

Consul state

Here is the entire consul kv get -recurse:

stolon/cluster/stolon-cluster/clusterdata:{"FormatVersion":1,"Cluster":{"uid":"eb69e690","generation":1,"changeTime":"2017-08-03T00:17:30.741021113Z","spec":{"sleepInterval":"1s","requestTimeout":"1s","failInterval":"1s","synchronousReplication":true,"minSynchronousStandbys":1,"maxSynchronousStandbys":2,"additionalWalSenders":null,"usePgrewind":true,"initMode":"new","pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"pgHBA":null},"status":{"phase":"normal","master":"68afe44d"}},"Keepers":{"node_1":{"uid":"node_1","generation":1,"changeTime":"0001-01-01T00:00:00Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2017-12-30T15:53:09.777366127Z","bootUUID":"fb509513-48a4-486f-8117-52ee2a5428fb"}},"node_2":{"uid":"node_2","generation":1,"changeTime":"0001-01-01T00:00:00Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2017-12-30T15:53:09.777298942Z","bootUUID":"32b1c88b-14a8-4075-b374-19c94d16e75a"}},"node_3":{"uid":"node_3","generation":1,"changeTime":"0001-01-01T00:00:00Z","spec":{},"status":{"lastHealthyTime":"2017-12-30T15:38:52.41998612Z","bootUUID":"eee3d63d-8ce0-42c3-b59d-08839ce4ea20"}}},"DBs":{"03d0d281":{"uid":"03d0d281","generation":2,"changeTime":"2017-12-30T03:48:31.461779244Z","spec":{"keeperUID":"node_2","requestTimeout":"1s","maxStandbys":20,"synchronousReplication":true,"usePgrewind":true,"additionalWalSenders":5,"initMode":"none","pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"pgHBA":null,"role":"standby","followConfig":{"type":"internal","dbuid":"ff0af56a"},"followers":[],"synchronousStandbys":null},"status":{"healthy":true,"currentGeneration":2,"listenAddress":"10.0.0.2","port":"5432","systemdID":"6449835973375088046","timelineID":41,"xLogPos":4010267952,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":[],"olderWalFile":"0000002900000000000000EE"}},"68afe44d":{"uid":"68afe44d","generation":5,"changeTime":"2017-12-30T15:38:54.416639501Z","spec":{"keeperUID":"node_1","requestTimeout":"1s","maxStandbys":20,"synchronousReplication":true,"usePgrewind":true,"additionalWalSenders":5,"initMode":"none","pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"pgHBA":null,"role":"master","followers":[],"synchronousStandbys":["ff0af56a"]},"status":{"healthy":true,"currentGeneration":5,"listenAddress":"10.0.0.1","port":"5432","systemdID":"6449835973375088046","timelineID":42,"xLogPos":4010268000,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"},{"timelineID":41,"switchPoint":4010267280,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":["ff0af56a"],"olderWalFile":"0000002700000000000000EA"}},"ff0af56a":{"uid":"ff0af56a","generation":30,"changeTime":"2017-12-30T15:38:54.416705191Z","spec":{"keeperUID":"node_3","requestTimeout":"1s","maxStandbys":20,"synchronousReplication":true,"usePgrewind":true,"additionalWalSenders":5,"initMode":"none","pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"pgHBA":null,"role":"master","followers":[],"synchronousStandbys":["03d0d281","68afe44d"]},"status":{"currentGeneration":29,"listenAddress":"10.0.0.3","port":"5432","systemdID":"6449835973375088046","timelineID":41,"xLogPos":4010263432,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":["03d0d281","68afe44d"],"olderWalFile":"0000002800000000000000EC"}}},"Proxy":{"generation":294,"changeTime":"2017-12-30T15:38:54.416496127Z","spec":{"enabledProxies":["9fd707d7","da5fd3f9","f9cea2d0"]},"status":{}}}
stolon/cluster/stolon-cluster/keepers/info/node_1:{"infoUID":"4a62361b","uid":"node_1","clusterUID":"eb69e690","bootUUID":"fb509513-48a4-486f-8117-52ee2a5428fb","postgresState":{"uid":"68afe44d","generation":5,"listenAddress":"10.0.0.1","port":"5432","healthy":true,"systemID":"6449835973375088046","timelineID":42,"xLogPos":4010268000,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"},{"timelineID":41,"switchPoint":4010267280,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":["ff0af56a"],"olderWalFile":"0000002700000000000000EA"}}
stolon/cluster/stolon-cluster/keepers/info/node_2:{"infoUID":"50a960f0","uid":"node_2","clusterUID":"eb69e690","bootUUID":"32b1c88b-14a8-4075-b374-19c94d16e75a","postgresState":{"uid":"03d0d281","generation":2,"listenAddress":"10.0.0.2","port":"5432","healthy":true,"systemID":"6449835973375088046","timelineID":41,"xLogPos":4010267952,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":[],"olderWalFile":"0000002900000000000000EE"}}
stolon/cluster/stolon-cluster/proxies/info/9fd707d7:{"UID":"9fd707d7","Generation":293}
stolon/cluster/stolon-cluster/proxies/info/da5fd3f9:{"UID":"da5fd3f9","Generation":294}
stolon/cluster/stolon-cluster/proxies/info/f9cea2d0:{"UID":"f9cea2d0","Generation":294}
stolon/cluster/stolon-cluster/sentinel-leader:2240a711
stolon/cluster/stolon-cluster/sentinels/info/2240a711:{"UID":"2240a711"}
stolon/cluster/stolon-cluster/sentinels/info/4f490a5e:{"UID":"4f490a5e"}

node-1

# systemctl status stolon-sentinel stolon-keeper stolon-proxy | cat
● stolon-sentinel.service
   Loaded: loaded (/nix/store/m3bp4q8abxmbiqb99hsc4drwhzbxda28-unit-stolon-sentinel.service/stolon-sentinel.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2017-12-29 20:34:19 UTC; 19h ago
 Main PID: 1338 (stolon-sentinel)
    Tasks: 13 (limit: 4915)
   CGroup: /system.slice/stolon-sentinel.service
           └─1338 /nix/store/7h67r59kxvlfpsnajqyr2l0y4a8rzslc-stolon-0.7.0/bin/stolon-sentinel --cluster-name stolon-cluster --store-backend=consul --initial-cluster-spec /etc/stolon/stolon-cluster.initial-cluster-spec

Dec 30 15:56:18 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:18.882Z        ERROR        sentinel/sentinel.go:287        no keeper info available        {"db": "ff0af56a", "keeper": "node_3"}
Dec 30 15:56:18 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:18.885Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 15:56:20 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:20.758Z        ERROR        sentinel/sentinel.go:287        no keeper info available        {"db": "ff0af56a", "keeper": "node_3"}
Dec 30 15:56:20 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:20.760Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 15:56:22 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:22.299Z        ERROR        sentinel/sentinel.go:287        no keeper info available        {"db": "ff0af56a", "keeper": "node_3"}
Dec 30 15:56:22 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:22.302Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 15:56:23 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:23.703Z        ERROR        sentinel/sentinel.go:287        no keeper info available        {"db": "ff0af56a", "keeper": "node_3"}
Dec 30 15:56:23 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:23.705Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 15:56:25 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:25.025Z        ERROR        sentinel/sentinel.go:287        no keeper info available        {"db": "ff0af56a", "keeper": "node_3"}
Dec 30 15:56:25 node-1 stolon-sentinel[1338]: 2017-12-30T15:56:25.028Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master

● stolon-keeper.service
   Loaded: loaded (/nix/store/nv54kip600rdnrfd4s4mj7hi9nv9brb7-unit-stolon-keeper.service/stolon-keeper.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2017-12-29 20:34:19 UTC; 19h ago
 Main PID: 1334 (stolon-keeper)
    Tasks: 31 (limit: 4915)
   CGroup: /system.slice/stolon-keeper.service
           ├─ 1334 /nix/store/7h67r59kxvlfpsnajqyr2l0y4a8rzslc-stolon-0.7.0/bin/stolon-keeper --cluster-name stolon-cluster --store-backend=consul --uid node_1 --data-dir /var/lib/postgres/node_1 --pg-listen-address=10.0.0.1 --pg-bin-path=/nix/store/ps50b6b8fm0z9nkgpaz1f0ifq00125yr-postgresql-9.6.5/bin --pg-su-username=postgres --pg-su-passwordfile=/etc/stolon/stolon-cluster.password --pg-repl-username=postgres --pg-repl-passwordfile=/etc/stolon/stolon-cluster.password
           ├─ 8978 /nix/store/ps50b6b8fm0z9nkgpaz1f0ifq00125yr-postgresql-9.6.5/bin/postgres -D /var/lib/postgres/node_1/postgres -c unix_socket_directories=/tmp
           ├─ 8979 postgres: logger process   
           ├─ 8981 postgres: checkpointer process   
           ├─ 8982 postgres: writer process   
           ├─ 8983 postgres: stats collector process   
           ├─16410 postgres: postgres postgres [local] idle
           ├─16411 postgres: postgres postgres [local] idle
           ├─16412 postgres: postgres postgres [local] idle
           ├─16413 postgres: postgres postgres [local] idle
           ├─16414 postgres: postgres postgres [local] idle
           ├─16415 postgres: postgres postgres [local] idle
           ├─16416 postgres: postgres postgres [local] idle
           ├─16417 postgres: postgres postgres [local] idle
           ├─16418 postgres: postgres postgres [local] idle
           ├─16419 postgres: postgres postgres [local] idle
           ├─16421 postgres: wal sender process postgres [local] idle
           ├─29304 postgres: wal writer process   
           └─29305 postgres: autovacuum launcher process   

Dec 30 15:56:22 node-1 stolon-keeper[1334]: 2017-12-30T15:56:22.676Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Dec 30 15:56:22 node-1 stolon-keeper[1334]: 2017-12-30T15:56:22.676Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed
Dec 30 15:56:23 node-1 stolon-keeper[1334]: 2017-12-30T15:56:23.802Z        INFO        keeper/keeper.go:1314        our db requested role is master
Dec 30 15:56:23 node-1 stolon-keeper[1334]: 2017-12-30T15:56:23.802Z        INFO        keeper/keeper.go:1334        already master
Dec 30 15:56:23 node-1 stolon-keeper[1334]: 2017-12-30T15:56:23.804Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Dec 30 15:56:23 node-1 stolon-keeper[1334]: 2017-12-30T15:56:23.804Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed
Dec 30 15:56:24 node-1 stolon-keeper[1334]: 2017-12-30T15:56:24.982Z        INFO        keeper/keeper.go:1314        our db requested role is master
Dec 30 15:56:24 node-1 stolon-keeper[1334]: 2017-12-30T15:56:24.982Z        INFO        keeper/keeper.go:1334        already master
Dec 30 15:56:24 node-1 stolon-keeper[1334]: 2017-12-30T15:56:24.984Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Dec 30 15:56:24 node-1 stolon-keeper[1334]: 2017-12-30T15:56:24.984Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed

● stolon-proxy.service
   Loaded: loaded (/nix/store/wnr63ngqb845irpii6128m0v8f2347xn-unit-stolon-proxy.service/stolon-proxy.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2017-12-29 20:34:19 UTC; 19h ago
 Main PID: 1335 (stolon-proxy)
    Tasks: 10 (limit: 4915)
   CGroup: /system.slice/stolon-proxy.service
           └─1335 /nix/store/7h67r59kxvlfpsnajqyr2l0y4a8rzslc-stolon-0.7.0/bin/stolon-proxy --cluster-name stolon-cluster --store-backend=consul --listen-address 0.0.0.0 --port 25432

Dec 30 15:55:35 node-1 stolon-proxy[1335]: 2017-12-30T15:55:35.547Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:41 node-1 stolon-proxy[1335]: 2017-12-30T15:55:41.443Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:46 node-1 stolon-proxy[1335]: 2017-12-30T15:55:46.600Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:51 node-1 stolon-proxy[1335]: 2017-12-30T15:55:51.874Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:57 node-1 stolon-proxy[1335]: 2017-12-30T15:55:57.299Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:56:02 node-1 stolon-proxy[1335]: 2017-12-30T15:56:02.532Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:56:07 node-1 stolon-proxy[1335]: 2017-12-30T15:56:07.940Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:56:13 node-1 stolon-proxy[1335]: 2017-12-30T15:56:13.203Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:56:18 node-1 stolon-proxy[1335]: 2017-12-30T15:56:18.537Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:56:24 node-1 stolon-proxy[1335]: 2017-12-30T15:56:24.151Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}

node-2

# systemctl status stolon-sentinel stolon-keeper stolon-proxy | cat
● stolon-sentinel.service
   Loaded: loaded (/nix/store/m3bp4q8abxmbiqb99hsc4drwhzbxda28-unit-stolon-sentinel.service/stolon-sentinel.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-12-20 07:36:54 UTC; 1 weeks 3 days ago
 Main PID: 1349 (stolon-sentinel)
    Tasks: 21 (limit: 4915)
   CGroup: /system.slice/stolon-sentinel.service
           └─1349 /nix/store/7h67r59kxvlfpsnajqyr2l0y4a8rzslc-stolon-0.7.0/bin/stolon-sentinel --cluster-name stolon-cluster --store-backend=consul --initial-cluster-spec /etc/stolon/stolon-cluster.initial-cluster-spec

Dec 30 15:37:51 node-2 stolon-sentinel[1349]: 2017-12-30T15:37:51.846Z        ERROR        sentinel/sentinel.go:1580        error retrieving cluster data        {"error": "Unexpected response code: 500"}
Dec 30 15:38:00 node-2 stolon-sentinel[1349]: 2017-12-30T15:38:00.111Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership
Dec 30 15:38:00 node-2 stolon-sentinel[1349]: 2017-12-30T15:38:00.157Z        ERROR        sentinel/sentinel.go:1580        error retrieving cluster data        {"error": "Unexpected response code: 500"}
Dec 30 15:38:07 node-2 stolon-sentinel[1349]: 2017-12-30T15:38:07.251Z        ERROR        sentinel/sentinel.go:127        election loop error        {"error": "Unexpected response code: 500 (No cluster leader)"}
Dec 30 15:38:08 node-2 stolon-sentinel[1349]: 2017-12-30T15:38:08.196Z        ERROR        sentinel/sentinel.go:1580        error retrieving cluster data        {"error": "Unexpected response code: 500"}
Dec 30 15:38:16 node-2 stolon-sentinel[1349]: 2017-12-30T15:38:16.296Z        ERROR        sentinel/sentinel.go:1580        error retrieving cluster data        {"error": "Unexpected response code: 500"}
Dec 30 15:38:17 node-2 stolon-sentinel[1349]: 2017-12-30T15:38:17.251Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership
Dec 30 15:38:24 node-2 stolon-sentinel[1349]: 2017-12-30T15:38:24.328Z        ERROR        sentinel/sentinel.go:1580        error retrieving cluster data        {"error": "Unexpected response code: 500"}
Dec 30 15:38:24 node-2 stolon-sentinel[1349]: 2017-12-30T15:38:24.400Z        ERROR        sentinel/sentinel.go:127        election loop error        {"error": "Unexpected response code: 500 (No cluster leader)"}
Dec 30 15:38:34 node-2 stolon-sentinel[1349]: 2017-12-30T15:38:34.400Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership

● stolon-keeper.service
   Loaded: loaded (/nix/store/s99yw78h5nv34v29jcxjs85kglvf1i5l-unit-stolon-keeper.service/stolon-keeper.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-12-20 07:36:54 UTC; 1 weeks 3 days ago
 Main PID: 1351 (stolon-keeper)
    Tasks: 98 (limit: 4915)
   CGroup: /system.slice/stolon-keeper.service
           ├─ 1351 /nix/store/7h67r59kxvlfpsnajqyr2l0y4a8rzslc-stolon-0.7.0/bin/stolon-keeper --cluster-name stolon-cluster --store-backend=consul --uid node_2 --data-dir /var/lib/postgres/node_2 --pg-listen-address=10.0.0.2 --pg-bin-path=/nix/store/ps50b6b8fm0z9nkgpaz1f0ifq00125yr-postgresql-9.6.5/bin --pg-su-username=postgres --pg-su-passwordfile=/etc/stolon/stolon-cluster.password --pg-repl-username=postgres --pg-repl-passwordfile=/etc/stolon/stolon-cluster.password
           ├─ 9207 postgres: wal receiver process   streaming 0/EF07D530
           ├─ 9979 postgres: postgres postgres [local] idle
           ├─ 9980 postgres: postgres postgres [local] idle
           ├─ 9981 postgres: postgres postgres [local] idle
           ├─ 9982 postgres: postgres postgres [local] idle
           ├─ 9983 postgres: postgres postgres [local] idle
           ├─ 9984 postgres: postgres postgres [local] idle
           ├─ 9985 postgres: postgres postgres [local] idle
           ├─ 9986 postgres: postgres postgres [local] idle
           ├─ 9987 postgres: postgres postgres [local] idle
           ├─ 9988 postgres: postgres postgres [local] idle
           ├─ 9989 postgres: postgres postgres [local] idle
           ├─ 9990 postgres: postgres postgres [local] idle
           ├─ 9991 postgres: postgres postgres [local] idle
           ├─ 9992 postgres: wal sender process postgres [local] idle
           ├─11214 postgres: postgres postgres [local] idle
           ├─11218 postgres: postgres postgres [local] idle
           ├─11221 postgres: postgres postgres [local] idle
           ├─11222 postgres: postgres postgres [local] idle
           ├─11223 postgres: postgres postgres [local] idle
           ├─11224 postgres: postgres postgres [local] idle
           ├─11225 postgres: postgres postgres [local] idle
           ├─11226 postgres: postgres postgres [local] idle
           ├─11227 postgres: postgres postgres [local] idle
           ├─11228 postgres: postgres postgres [local] idle
           ├─11229 postgres: wal sender process postgres [local] idle
           ├─15454 /nix/store/ps50b6b8fm0z9nkgpaz1f0ifq00125yr-postgresql-9.6.5/bin/postgres -D /var/lib/postgres/node_2/postgres -c unix_socket_directories=/tmp
           ├─17148 postgres: logger process   
           ├─17149 postgres: startup process   recovering 0000002900000000000000EF
           ├─18789 postgres: wal sender process postgres [local] idle
           ├─22733 postgres: checkpointer process   
           ├─22734 postgres: writer process   
           ├─22752 postgres: stats collector process   
           ├─22761 postgres: postgres postgres [local] idle
           ├─22792 postgres: postgres postgres [local] idle
           ├─22825 postgres: postgres postgres [local] idle
           ├─22855 postgres: postgres postgres [local] idle
           ├─22888 postgres: postgres postgres [local] idle
           ├─22919 postgres: postgres postgres [local] idle
           ├─22950 postgres: postgres postgres [local] idle
           ├─22981 postgres: postgres postgres [local] idle
           ├─23020 postgres: postgres postgres [local] idle
           ├─23051 postgres: postgres postgres [local] idle
           ├─23084 postgres: postgres postgres [local] idle
           ├─23113 postgres: postgres postgres [local] idle
           ├─23144 postgres: postgres postgres [local] idle
           ├─23177 postgres: postgres postgres [local] idle
           ├─23207 postgres: postgres postgres [local] idle
           ├─23240 postgres: postgres postgres [local] idle
           ├─23269 postgres: postgres postgres [local] idle
           ├─23302 postgres: postgres postgres [local] idle
           ├─23331 postgres: postgres postgres [local] idle
           ├─23364 postgres: postgres postgres [local] idle
           ├─23395 postgres: postgres postgres [local] idle
           ├─23439 postgres: postgres postgres [local] idle
           ├─23470 postgres: postgres postgres [local] idle
           ├─23569 postgres: postgres postgres [local] idle
           ├─27959 postgres: postgres postgres [local] idle
           ├─27960 postgres: postgres postgres [local] idle
           ├─27961 postgres: postgres postgres [local] idle
           ├─27962 postgres: postgres postgres [local] idle
           ├─27963 postgres: postgres postgres [local] idle
           ├─27964 postgres: postgres postgres [local] idle
           ├─27965 postgres: postgres postgres [local] idle
           ├─27966 postgres: postgres postgres [local] idle
           ├─27967 postgres: postgres postgres [local] idle
           ├─27968 postgres: postgres postgres [local] idle
           ├─27969 postgres: postgres postgres [local] idle
           ├─27970 postgres: postgres postgres [local] idle
           ├─27971 postgres: postgres postgres [local] idle
           ├─27973 postgres: wal sender process postgres [local] idle
           ├─29646 postgres: postgres postgres [local] idle
           ├─29647 postgres: postgres postgres [local] idle
           ├─29648 postgres: postgres postgres [local] idle
           ├─29649 postgres: postgres postgres [local] idle
           └─29651 postgres: wal sender process postgres [local] idle

Dec 30 15:56:11 node-2 stolon-keeper[1351]: 2017-12-30T15:56:11.541Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Dec 30 15:56:11 node-2 stolon-keeper[1351]: 2017-12-30T15:56:11.541Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed
Dec 30 15:56:12 node-2 stolon-keeper[1351]: 2017-12-30T15:56:12.690Z        INFO        keeper/keeper.go:1348        our db requested role is standby        {"followedDB": "ff0af56a"}
Dec 30 15:56:12 node-2 stolon-keeper[1351]: 2017-12-30T15:56:12.690Z        INFO        keeper/keeper.go:1367        already standby
Dec 30 15:56:12 node-2 stolon-keeper[1351]: 2017-12-30T15:56:12.690Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Dec 30 15:56:12 node-2 stolon-keeper[1351]: 2017-12-30T15:56:12.690Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed
Dec 30 15:56:14 node-2 stolon-keeper[1351]: 2017-12-30T15:56:14.111Z        INFO        keeper/keeper.go:1348        our db requested role is standby        {"followedDB": "ff0af56a"}
Dec 30 15:56:14 node-2 stolon-keeper[1351]: 2017-12-30T15:56:14.111Z        INFO        keeper/keeper.go:1367        already standby
Dec 30 15:56:14 node-2 stolon-keeper[1351]: 2017-12-30T15:56:14.111Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Dec 30 15:56:14 node-2 stolon-keeper[1351]: 2017-12-30T15:56:14.111Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed

● stolon-proxy.service
   Loaded: loaded (/nix/store/wnr63ngqb845irpii6128m0v8f2347xn-unit-stolon-proxy.service/stolon-proxy.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-12-20 07:36:54 UTC; 1 weeks 3 days ago
 Main PID: 1348 (stolon-proxy)
    Tasks: 19 (limit: 4915)
   CGroup: /system.slice/stolon-proxy.service
           └─1348 /nix/store/7h67r59kxvlfpsnajqyr2l0y4a8rzslc-stolon-0.7.0/bin/stolon-proxy --cluster-name stolon-cluster --store-backend=consul --listen-address 0.0.0.0 --port 25432

Dec 30 15:55:24 node-2 stolon-proxy[1348]: 2017-12-30T15:55:24.918Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:30 node-2 stolon-proxy[1348]: 2017-12-30T15:55:30.582Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:36 node-2 stolon-proxy[1348]: 2017-12-30T15:55:36.151Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:41 node-2 stolon-proxy[1348]: 2017-12-30T15:55:41.441Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:46 node-2 stolon-proxy[1348]: 2017-12-30T15:55:46.599Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:52 node-2 stolon-proxy[1348]: 2017-12-30T15:55:52.258Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:55:57 node-2 stolon-proxy[1348]: 2017-12-30T15:55:57.508Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:56:02 node-2 stolon-proxy[1348]: 2017-12-30T15:56:02.690Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:56:07 node-2 stolon-proxy[1348]: 2017-12-30T15:56:07.939Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:56:13 node-2 stolon-proxy[1348]: 2017-12-30T15:56:13.203Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}

node-3

# systemctl status stolon-sentinel stolon-keeper stolon-proxy | cat
● stolon-sentinel.service
   Loaded: loaded (/nix/store/m3bp4q8abxmbiqb99hsc4drwhzbxda28-unit-stolon-sentinel.service/stolon-sentinel.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-12-20 07:36:54 UTC; 1 weeks 3 days ago
 Main PID: 19884 (stolon-sentinel)
    Tasks: 21 (limit: 4915)
   CGroup: /system.slice/stolon-sentinel.service
           └─19884 /nix/store/7h67r59kxvlfpsnajqyr2l0y4a8rzslc-stolon-0.7.0/bin/stolon-sentinel --cluster-name stolon-cluster --store-backend=consul --initial-cluster-spec /etc/stolon/stolon-cluster.initial-cluster-spec

Dec 30 15:37:07 node-3 stolon-sentinel[19884]: 2017-12-30T15:37:07.509Z        ERROR        sentinel/sentinel.go:127        election loop error        {"error": "Unexpected response code: 500 (rpc error: No cluster leader)"}
Dec 30 15:37:17 node-3 stolon-sentinel[19884]: 2017-12-30T15:37:17.510Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership
Dec 30 15:37:24 node-3 stolon-sentinel[19884]: 2017-12-30T15:37:24.574Z        ERROR        sentinel/sentinel.go:127        election loop error        {"error": "Unexpected response code: 500 (rpc error: No cluster leader)"}
Dec 30 15:37:34 node-3 stolon-sentinel[19884]: 2017-12-30T15:37:34.574Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership
Dec 30 15:37:41 node-3 stolon-sentinel[19884]: 2017-12-30T15:37:41.771Z        ERROR        sentinel/sentinel.go:127        election loop error        {"error": "Unexpected response code: 500 (rpc error: No cluster leader)"}
Dec 30 15:37:51 node-3 stolon-sentinel[19884]: 2017-12-30T15:37:51.771Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership
Dec 30 15:37:58 node-3 stolon-sentinel[19884]: 2017-12-30T15:37:58.908Z        ERROR        sentinel/sentinel.go:127        election loop error        {"error": "Unexpected response code: 500 (rpc error: No cluster leader)"}
Dec 30 15:38:08 node-3 stolon-sentinel[19884]: 2017-12-30T15:38:08.908Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership
Dec 30 15:38:16 node-3 stolon-sentinel[19884]: 2017-12-30T15:38:16.237Z        ERROR        sentinel/sentinel.go:127        election loop error        {"error": "Unexpected response code: 500 (rpc error: No cluster leader)"}
Dec 30 15:38:26 node-3 stolon-sentinel[19884]: 2017-12-30T15:38:26.237Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership

● stolon-keeper.service
   Loaded: loaded (/nix/store/z1n8br6q18xlx4lqbbsm9dg1ql9hj04x-unit-stolon-keeper.service/stolon-keeper.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-12-20 07:36:54 UTC; 1 weeks 3 days ago
 Main PID: 19880 (stolon-keeper)
    Tasks: 29 (limit: 4915)
   CGroup: /system.slice/stolon-keeper.service
           ├─ 8205 postgres: wal sender process postgres 10.0.0.2(59502) streaming 0/EF07D610
           ├─19880 /nix/store/7h67r59kxvlfpsnajqyr2l0y4a8rzslc-stolon-0.7.0/bin/stolon-keeper --cluster-name stolon-cluster --store-backend=consul --uid node_3 --data-dir /var/lib/postgres/node_3 --pg-listen-address=10.0.0.3 --pg-bin-path=/nix/store/ps50b6b8fm0z9nkgpaz1f0ifq00125yr-postgresql-9.6.5/bin --pg-su-username=postgres --pg-su-passwordfile=/etc/stolon/stolon-cluster.password --pg-repl-username=postgres --pg-repl-passwordfile=/etc/stolon/stolon-cluster.password
           ├─25507 /nix/store/ps50b6b8fm0z9nkgpaz1f0ifq00125yr-postgresql-9.6.5/bin/postgres -D /var/lib/postgres/node_3/postgres -c unix_socket_directories=/tmp
           ├─25508 postgres: logger process   
           ├─25512 postgres: checkpointer process   
           ├─25513 postgres: writer process   
           ├─25514 postgres: stats collector process   
           ├─26881 postgres: wal writer process   
           └─26882 postgres: autovacuum launcher process   

Dec 30 15:57:15 node-3 stolon-keeper[19880]: 2017-12-30T15:57:15.618Z        INFO        keeper/keeper.go:808        dropping replication slot since db not marked as follower        {"slot": "stolon_03d0d281", "db": "03d0d281"}
Dec 30 15:57:15 node-3 stolon-keeper[19880]: 2017-12-30T15:57:15.619Z        ERROR        keeper/keeper.go:810        failed to drop replication slot        {"slotName": "stolon_03d0d281", "err": "pq: replication slot \"stolon_03d0d281\" is active for PID 8205"}
Dec 30 15:57:15 node-3 stolon-keeper[19880]: 2017-12-30T15:57:15.620Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Dec 30 15:57:15 node-3 stolon-keeper[19880]: 2017-12-30T15:57:15.620Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed
Dec 30 15:57:16 node-3 stolon-keeper[19880]: 2017-12-30T15:57:16.630Z        INFO        keeper/keeper.go:1314        our db requested role is master
Dec 30 15:57:16 node-3 stolon-keeper[19880]: 2017-12-30T15:57:16.630Z        INFO        keeper/keeper.go:1334        already master
Dec 30 15:57:16 node-3 stolon-keeper[19880]: 2017-12-30T15:57:16.634Z        INFO        keeper/keeper.go:808        dropping replication slot since db not marked as follower        {"slot": "stolon_03d0d281", "db": "03d0d281"}
Dec 30 15:57:16 node-3 stolon-keeper[19880]: 2017-12-30T15:57:16.637Z        ERROR        keeper/keeper.go:810        failed to drop replication slot        {"slotName": "stolon_03d0d281", "err": "pq: replication slot \"stolon_03d0d281\" is active for PID 8205"}
Dec 30 15:57:16 node-3 stolon-keeper[19880]: 2017-12-30T15:57:16.638Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Dec 30 15:57:16 node-3 stolon-keeper[19880]: 2017-12-30T15:57:16.638Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed

● stolon-proxy.service
   Loaded: loaded (/nix/store/wnr63ngqb845irpii6128m0v8f2347xn-unit-stolon-proxy.service/stolon-proxy.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-12-20 07:36:54 UTC; 1 weeks 3 days ago
 Main PID: 19881 (stolon-proxy)
    Tasks: 19 (limit: 4915)
   CGroup: /system.slice/stolon-proxy.service
           └─19881 /nix/store/7h67r59kxvlfpsnajqyr2l0y4a8rzslc-stolon-0.7.0/bin/stolon-proxy --cluster-name stolon-cluster --store-backend=consul --listen-address 0.0.0.0 --port 25432

Dec 30 15:00:09 node-3 stolon-proxy[19881]: 2017-12-30T15:00:09.272Z        INFO        proxy/proxy.go:256        master address        {"address": "10.0.0.3:5432"}
Dec 30 15:00:12 node-3 stolon-proxy[19881]: 2017-12-30T15:00:12.052Z        INFO        proxy/proxy.go:270        proxying to master address        {"address": "10.0.0.3:5432"}
Dec 30 15:00:17 node-3 stolon-proxy[19881]: 2017-12-30T15:00:17.059Z        INFO        proxy/proxy.go:256        master address        {"address": "10.0.0.3:5432"}
Dec 30 15:00:27 node-3 stolon-proxy[19881]: 2017-12-30T15:00:27.053Z        INFO        proxy/proxy.go:285        check timeout timer fired
Dec 30 15:00:27 node-3 stolon-proxy[19881]: 2017-12-30T15:00:27.053Z        INFO        proxy/proxy.go:172        Stopping listening
Dec 30 15:00:45 node-3 stolon-proxy[19881]: 2017-12-30T15:00:45.514Z        INFO        proxy/proxy.go:270        proxying to master address        {"address": "10.0.0.3:5432"}
Dec 30 15:00:50 node-3 stolon-proxy[19881]: 2017-12-30T15:00:50.519Z        INFO        proxy/proxy.go:143        Starting proxying
Dec 30 15:00:50 node-3 stolon-proxy[19881]: 2017-12-30T15:00:50.524Z        INFO        proxy/proxy.go:256        master address        {"address": "10.0.0.3:5432"}
Dec 30 15:01:00 node-3 stolon-proxy[19881]: 2017-12-30T15:01:00.515Z        INFO        proxy/proxy.go:285        check timeout timer fired
Dec 30 15:01:00 node-3 stolon-proxy[19881]: 2017-12-30T15:01:00.516Z        INFO        proxy/proxy.go:172        Stopping listening

Observations

On node-3, this seems like a problem:

dropping replication slot since db not marked as follower        {"slot": "stolon_03d0d281", "db": "03d0d281"}
 failed to drop replication slot        {"slotName": "stolon_03d0d281", "err": "pq: replication slot \"stolon_03d0d281\" is active for PID 8205"}

Also, there is a Unexpected response code: 500 (No cluster leader) in the sentinel log on node-2, but note that the timestamp on that is 30 minutes older than the other timestamp, so Consul already has a leader for a long time by now.

Is it problematic that the sentinel hasn't output any info for the last 30 minutes?

nh2 commented 6 years ago

Here's stolonctl status:

# stolonctl status --cluster-name stolon-cluster --store-backend=consul
=== Active sentinels ===

ID    LEADER
2240a711  true
4f490a5e  false

=== Active proxies ===

ID
9fd707d7
da5fd3f9
f9cea2d0

=== Keepers ===

UID   HEALTHY PG LISTENADDRESS  PG HEALTHY  PG WANTEDGENERATION PG CURRENTGENERATION
node_1 true  10.0.0.1:5432   true    5     5 
node_2 true  10.0.0.2:5432   true    2     2 
node_3 false 10.0.0.3:5432   false   30      29  

=== Cluster Info ===

Master: node_1

===== Keepers/DB tree =====

node_1 (master)
nh2 commented 6 years ago

After doing stolonctl restart stolon-keeper on node-3, the problem persists, with the only difference that no more errors are shown:

Dec 30 16:03:09 node-3 stolon-keeper[25911]: 2017-12-30T16:03:09.097Z        INFO        keeper/keeper.go:1314        our db requested role is master
Dec 30 16:03:09 node-3 stolon-keeper[25911]: 2017-12-30T16:03:09.097Z        INFO        keeper/keeper.go:1334        already master
Dec 30 16:03:09 node-3 stolon-keeper[25911]: 2017-12-30T16:03:09.099Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Dec 30 16:03:09 node-3 stolon-keeper[25911]: 2017-12-30T16:03:09.099Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed
Dec 30 16:03:10 node-3 stolon-keeper[25911]: 2017-12-30T16:03:10.113Z        INFO        keeper/keeper.go:1314        our db requested role is master
Dec 30 16:03:10 node-3 stolon-keeper[25911]: 2017-12-30T16:03:10.113Z        INFO        keeper/keeper.go:1334        already master
Dec 30 16:03:10 node-3 stolon-keeper[25911]: 2017-12-30T16:03:10.119Z        INFO        keeper/keeper.go:1489        postgres parameters not changed

so the pq: replication slot \"stolon_03d0d281\" is active for PID 8205 message is gone.

# stolonctl status --cluster-name stolon-cluster --store-backend=consul
=== Active sentinels ===

ID    LEADER
2240a711  true
4f490a5e  false

=== Active proxies ===

ID
9fd707d7
da5fd3f9
f9cea2d0

=== Keepers ===

UID   HEALTHY PG LISTENADDRESS  PG HEALTHY  PG WANTEDGENERATION PG CURRENTGENERATION
node_1 true  10.0.0.1:5432   true    5     5 
node_2 true  10.0.0.2:5432   true    2     2 
node_3 true  10.0.0.3:5432   true    30      30  

=== Cluster Info ===

Master: node_1

===== Keepers/DB tree =====

node_1 (master)
nh2 commented 6 years ago

It also seems that it would be helpful if waiting for proxies to close connections to old master could say for which proxies we're waiting for.

sgotti commented 6 years ago

@nh2 I'm not sure what's the real issue here. This is a common state during a new election and should converge once the new elected master is ready. Is this not happening?

I can see different errors talking to consul: {"error": "Unexpected response code: 500 (No cluster leader)"} so perhaps the sentinel has some problems updating the cluster data.

Anyway It'll be useful if you can also provide a longer leader sentinel and keepers log since the systemctl one id only few lines.

P.S. I'm on vacation so I won't be able to give fast answers for the next days.

It also seems that it would be helpful if waiting for proxies to close connections to old master could say for which proxies we're waiting for.

Please file a new issue.

nh2 commented 6 years ago

At that point we have:

# stolonctl status --cluster-name stolon-cluster --store-backend=consul
=== Active sentinels ===

ID    LEADER
184d0de3  false
2240a711  true
4f490a5e  false

=== Active proxies ===

ID
9fd707d7
a7e79bd1
da5fd3f9
f9cea2d0

=== Keepers ===

UID   HEALTHY PG LISTENADDRESS  PG HEALTHY  PG WANTEDGENERATION PG CURRENTGENERATION
node_1 true  10.0.0.1:5432   true    5     5 
node_2 true  10.0.0.2:5432   true    2     2 
node_3 true  10.0.0.3:5432   true    30      30  

=== Cluster Info ===

Master: node_1

===== Keepers/DB tree =====

node_1 (master)

And the keeper of node-2 is still follwing the {"followedDB": "ff0af56a"} (node 3).

The final stolonctl status is

# stolonctl status --cluster-name stolon-cluster --store-backend=consul
=== Active sentinels ===

ID    LEADER
6daad7a9  true
c2611a6b  false
f3fb3a9a  false

=== Active proxies ===

ID
0c2f474e
276e61af
506859ac
9fd707d7

=== Keepers ===

UID   HEALTHY PG LISTENADDRESS  PG HEALTHY  PG WANTEDGENERATION PG CURRENTGENERATION
node_1 true  10.0.0.1:5432   true    5     5 
node_2 true  10.0.0.2:5432   true    2     2 
node_3 true  10.0.0.3:5432   true    30      30  

=== Cluster Info ===

Master: node_1

===== Keepers/DB tree =====

node_1 (master)

The final consul contents are:

stolon/cluster/stolon-cluster/clusterdata:{"FormatVersion":1,"Cluster":{"uid":"eb69e690","generation":1,"changeTime":"2017-08-03T00:17:30.741021113Z","spec":{"sleepInterval":"1s","requestTimeout":"1s","failInterval":"1s","synchronousReplication":true,"minSynchronousStandbys":1,"maxSynchronousStandbys":2,"additionalWalSenders":null,"usePgrewind":true,"initMode":"new","pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"pgHBA":null},"status":{"phase":"normal","master":"68afe44d"}},"Keepers":{"node_1":{"uid":"node_1","generation":1,"changeTime":"0001-01-01T00:00:00Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2017-12-30T16:29:41.394209575Z","bootUUID":"e36ba318-a819-4c22-b10a-a6e26899b945"}},"node_2":{"uid":"node_2","generation":1,"changeTime":"0001-01-01T00:00:00Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2017-12-30T16:29:41.394209636Z","bootUUID":"cc26bfb7-8d26-459c-bb2a-ec66f8bad4b1"}},"node_3":{"uid":"node_3","generation":1,"changeTime":"0001-01-01T00:00:00Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2017-12-30T16:29:41.394209393Z","bootUUID":"5a1b4ef2-cd0e-456c-929f-cd7b0d1ae462"}}},"DBs":{"03d0d281":{"uid":"03d0d281","generation":2,"changeTime":"2017-12-30T03:48:31.461779244Z","spec":{"keeperUID":"node_2","requestTimeout":"1s","maxStandbys":20,"synchronousReplication":true,"usePgrewind":true,"additionalWalSenders":5,"initMode":"none","pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"pgHBA":null,"role":"standby","followConfig":{"type":"internal","dbuid":"ff0af56a"},"followers":[],"synchronousStandbys":null},"status":{"healthy":true,"currentGeneration":2,"listenAddress":"10.0.0.2","port":"5432","systemdID":"6449835973375088046","timelineID":41,"xLogPos":4010268400,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":[],"olderWalFile":"0000002900000000000000EE"}},"68afe44d":{"uid":"68afe44d","generation":5,"changeTime":"2017-12-30T15:38:54.416639501Z","spec":{"keeperUID":"node_1","requestTimeout":"1s","maxStandbys":20,"synchronousReplication":true,"usePgrewind":true,"additionalWalSenders":5,"initMode":"none","pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"pgHBA":null,"role":"master","followers":[],"synchronousStandbys":["ff0af56a"]},"status":{"healthy":true,"currentGeneration":5,"listenAddress":"10.0.0.1","port":"5432","systemdID":"6449835973375088046","timelineID":42,"xLogPos":4010270568,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"},{"timelineID":41,"switchPoint":4010267280,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":["ff0af56a"],"olderWalFile":"0000002700000000000000EA"}},"ff0af56a":{"uid":"ff0af56a","generation":30,"changeTime":"2017-12-30T15:38:54.416705191Z","spec":{"keeperUID":"node_3","requestTimeout":"1s","maxStandbys":20,"synchronousReplication":true,"usePgrewind":true,"additionalWalSenders":5,"initMode":"none","pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"pgHBA":null,"role":"master","followers":[],"synchronousStandbys":["03d0d281","68afe44d"]},"status":{"healthy":true,"currentGeneration":30,"listenAddress":"10.0.0.3","port":"5432","systemdID":"6449835973375088046","timelineID":41,"xLogPos":4010269856,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":["03d0d281","68afe44d"],"olderWalFile":"0000002800000000000000EC"}}},"Proxy":{"generation":294,"changeTime":"2017-12-30T15:38:54.416496127Z","spec":{"enabledProxies":["9fd707d7","da5fd3f9","f9cea2d0"]},"status":{}}}
stolon/cluster/stolon-cluster/keepers/info/node_1:{"infoUID":"abafa232","uid":"node_1","clusterUID":"eb69e690","bootUUID":"e36ba318-a819-4c22-b10a-a6e26899b945","postgresState":{"uid":"68afe44d","generation":5,"listenAddress":"10.0.0.1","port":"5432","healthy":true,"systemID":"6449835973375088046","timelineID":42,"xLogPos":4010270568,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"},{"timelineID":41,"switchPoint":4010267280,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":["ff0af56a"],"olderWalFile":"0000002700000000000000EA"}}
stolon/cluster/stolon-cluster/keepers/info/node_2:{"infoUID":"90999412","uid":"node_2","clusterUID":"eb69e690","bootUUID":"cc26bfb7-8d26-459c-bb2a-ec66f8bad4b1","postgresState":{"uid":"03d0d281","generation":2,"listenAddress":"10.0.0.2","port":"5432","healthy":true,"systemID":"6449835973375088046","timelineID":41,"xLogPos":4010268400,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":[],"olderWalFile":"0000002900000000000000EE"}}
stolon/cluster/stolon-cluster/keepers/info/node_3:{"infoUID":"2734c6e3","uid":"node_3","clusterUID":"eb69e690","bootUUID":"5a1b4ef2-cd0e-456c-929f-cd7b0d1ae462","postgresState":{"uid":"ff0af56a","generation":30,"listenAddress":"10.0.0.3","port":"5432","healthy":true,"systemID":"6449835973375088046","timelineID":41,"xLogPos":4010269856,"timelinesHistory":[{"timelineID":1,"switchPoint":570333488,"reason":"no recovery target specified"},{"timelineID":2,"switchPoint":2158584072,"reason":"no recovery target specified"},{"timelineID":3,"switchPoint":2159491040,"reason":"no recovery target specified"},{"timelineID":4,"switchPoint":2181144752,"reason":"no recovery target specified"},{"timelineID":5,"switchPoint":2184357288,"reason":"no recovery target specified"},{"timelineID":6,"switchPoint":2340848200,"reason":"no recovery target specified"},{"timelineID":7,"switchPoint":2522268872,"reason":"no recovery target specified"},{"timelineID":8,"switchPoint":2598753128,"reason":"no recovery target specified"},{"timelineID":9,"switchPoint":2619678008,"reason":"no recovery target specified"},{"timelineID":10,"switchPoint":2759027784,"reason":"no recovery target specified"},{"timelineID":11,"switchPoint":2806544384,"reason":"no recovery target specified"},{"timelineID":12,"switchPoint":2848483008,"reason":"no recovery target specified"},{"timelineID":13,"switchPoint":2849850512,"reason":"no recovery target specified"},{"timelineID":14,"switchPoint":2904032696,"reason":"no recovery target specified"},{"timelineID":15,"switchPoint":2905672936,"reason":"no recovery target specified"},{"timelineID":16,"switchPoint":2938352976,"reason":"no recovery target specified"},{"timelineID":17,"switchPoint":3243719280,"reason":"no recovery target specified"},{"timelineID":18,"switchPoint":3292207000,"reason":"no recovery target specified"},{"timelineID":19,"switchPoint":3295216024,"reason":"no recovery target specified"},{"timelineID":20,"switchPoint":3321923184,"reason":"no recovery target specified"},{"timelineID":21,"switchPoint":3417177656,"reason":"no recovery target specified"},{"timelineID":22,"switchPoint":3418551288,"reason":"no recovery target specified"},{"timelineID":23,"switchPoint":3420902376,"reason":"no recovery target specified"},{"timelineID":24,"switchPoint":3423162768,"reason":"no recovery target specified"},{"timelineID":25,"switchPoint":3424494696,"reason":"no recovery target specified"},{"timelineID":26,"switchPoint":3585421520,"reason":"no recovery target specified"},{"timelineID":27,"switchPoint":3745852328,"reason":"no recovery target specified"},{"timelineID":28,"switchPoint":3746892416,"reason":"no recovery target specified"},{"timelineID":29,"switchPoint":3775009480,"reason":"no recovery target specified"},{"timelineID":30,"switchPoint":3782922816,"reason":"no recovery target specified"},{"timelineID":31,"switchPoint":3784546448,"reason":"no recovery target specified"},{"timelineID":32,"switchPoint":3785559376,"reason":"no recovery target specified"},{"timelineID":33,"switchPoint":3811014624,"reason":"no recovery target specified"},{"timelineID":34,"switchPoint":3845283944,"reason":"no recovery target specified"},{"timelineID":35,"switchPoint":3864548464,"reason":"no recovery target specified"},{"timelineID":36,"switchPoint":3899969952,"reason":"no recovery target specified"},{"timelineID":37,"switchPoint":3909038816,"reason":"no recovery target specified"},{"timelineID":38,"switchPoint":3910550912,"reason":"no recovery target specified"},{"timelineID":39,"switchPoint":3952725112,"reason":"no recovery target specified"},{"timelineID":40,"switchPoint":3981566016,"reason":"no recovery target specified"}],"pgParameters":{"datestyle":"iso, mdy","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","lc_messages":"en_US.UTF-8","lc_monetary":"en_US.UTF-8","lc_numeric":"en_US.UTF-8","lc_time":"en_US.UTF-8","log_min_duration_statement":"10","log_timezone":"UTC","logging_collector":"on","max_connections":"100","shared_buffers":"128MB","timezone":"UTC"},"synchronousStandbys":["03d0d281","68afe44d"],"olderWalFile":"0000002800000000000000EC"}}
stolon/cluster/stolon-cluster/proxies/info/0c2f474e:{"UID":"0c2f474e","Generation":294}
stolon/cluster/stolon-cluster/proxies/info/276e61af:{"UID":"276e61af","Generation":294}
stolon/cluster/stolon-cluster/proxies/info/506859ac:{"UID":"506859ac","Generation":294}
stolon/cluster/stolon-cluster/proxies/info/9fd707d7:{"UID":"9fd707d7","Generation":293}
stolon/cluster/stolon-cluster/sentinel-leader:6daad7a9
stolon/cluster/stolon-cluster/sentinels/info/6daad7a9:{"UID":"6daad7a9"}
stolon/cluster/stolon-cluster/sentinels/info/c2611a6b:{"UID":"c2611a6b"}
stolon/cluster/stolon-cluster/sentinels/info/f3fb3a9a:{"UID":"f3fb3a9a"}

I have now restarted all services on all machines and and am still stuck.

nh2 commented 6 years ago

I can see different errors talking to consul: {"error": "Unexpected response code: 500 (No cluster leader)"} so perhaps the sentinel has some problems updating the cluster data.

@sgotti This leader loss was ~30 minutes before the other log outputs though, since then the sentinel has made no further output.

nh2 commented 6 years ago

I'm on vacation so I won't be able to give fast answers for the next days.

OK!

nh2 commented 6 years ago

Anyway It'll be useful if you can also provide a longer leader sentinel and keepers log since the systemctl one id only few lines.

@sgotti OK, here they are:

node-1-keeper-log.txt node-2-keeper-log.txt node-3-keeper-log.txt

node-1-sentinel-log.txt node-2-sentinel-log.txt node-3-sentinel-log.txt

There were some consul failures before, but with those recovered I we should still expect stolon to converge, is that right?

nh2 commented 6 years ago

With restarts not helping, I'm now trying to manually fix the clusterdata to get rid of the 2-masters situation.

I first tried this:

# consul kv get stolon/cluster/stolon-cluster/clusterdata > clusterdata.json
# cp clusterdata.json clusterdata-modified.json
# diff -u <(jq . clusterdata.json) <(jq . clusterdata-modified.json)
--- /dev/fd/63  2017-12-30 16:46:06.640387691 +0000
+++ /dev/fd/62  2017-12-30 16:46:06.645387750 +0000
@@ -365,11 +365,9 @@
           "timezone": "UTC"
         },
         "pgHBA": null,
-        "role": "master",
+        "role": "standby",
         "followers": [],
-        "synchronousStandbys": [
-          "ff0af56a"
-        ]
+        "synchronousStandbys": []
       },
       "status": {
         "healthy": true,
# consul kv put stolon/cluster/stolon-cluster/clusterdata @clusterdata-modified.json

This apparently brings the stolon-sentinel on node-3 to segfault with panic: runtime error: invalid memory address or nil pointer dereference:

Dec 30 16:44:18 node-3 stolon-sentinel[10643]: 2017-12-30T16:44:18.799Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 16:44:20 node-3 stolon-sentinel[10643]: 2017-12-30T16:44:20.774Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 16:44:23 node-3 stolon-sentinel[10643]: 2017-12-30T16:44:23.210Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 16:44:25 node-3 stolon-sentinel[10643]: 2017-12-30T16:44:25.184Z        ERROR        sentinel/sentinel.go:287        no keeper info available        {"db": "68afe44d", "keeper": "node_1"}
Dec 30 16:44:25 node-3 stolon-sentinel[10643]: 2017-12-30T16:44:25.186Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: 2017-12-30T16:44:28.084Z        INFO        sentinel/sentinel.go:953        master db is failed        {"db": "68afe44d", "keeper": "node_1"}
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: 2017-12-30T16:44:28.084Z        INFO        sentinel/sentinel.go:964        trying to find a new master to replace failed master
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: panic: runtime error: invalid memory address or nil pointer dereference
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x8eb8de]
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: goroutine 6123 [running]:
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: main.(*Sentinel).dbType(0xc4201a8210, 0xc4202401b0, 0xc420312810, 0x8, 0x2)
Dec 30 16:44:28 node-3 stolon-sentinel[10643]:         /tmp/nix-build-stolon-0.7.0.drv-0/stolon-v0.7.0-src/gopath/src/github.com/sorintlab/stolon/cmd/sentinel/sentinel.go:476 +0xce
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: main.(*Sentinel).validStandbysByStatus(0xc4201a8210, 0xc4202401b0, 0x400, 0x75, 0x0)
Dec 30 16:44:28 node-3 stolon-sentinel[10643]:         /tmp/nix-build-stolon-0.7.0.drv-0/stolon-v0.7.0-src/gopath/src/github.com/sorintlab/stolon/cmd/sentinel/sentinel.go:652 +0x1b1
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: main.(*Sentinel).findBestStandbys(0xc4201a8210, 0xc4202401b0, 0xc42016c750, 0x614626, 0xc4200a93f0, 0x0)
Dec 30 16:44:28 node-3 stolon-sentinel[10643]:         /tmp/nix-build-stolon-0.7.0.drv-0/stolon-v0.7.0-src/gopath/src/github.com/sorintlab/stolon/cmd/sentinel/sentinel.go:676 +0x50
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: main.(*Sentinel).findBestNewMasters(0xc4201a8210, 0xc4202401b0, 0xc42016c750, 0x0, 0x0, 0x0)
Dec 30 16:44:28 node-3 stolon-sentinel[10643]:         /tmp/nix-build-stolon-0.7.0.drv-0/stolon-v0.7.0-src/gopath/src/github.com/sorintlab/stolon/cmd/sentinel/sentinel.go:700 +0x5d
Dec 30 16:44:28 node-3 stolon-sentinel[10643]: main.(*Sentinel).updateCluster(0xc4201a8210, 0xc42055b560, 0xc420587dc0, 0x4, 0x4, 0x1, 0x28, 0x9e8580)
Dec 30 16:44:28 node-3 systemd[1]: stolon-sentinel.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 30 16:44:28 node-3 systemd[1]: stolon-sentinel.service: Unit entered failed state.
Dec 30 16:44:28 node-3 systemd[1]: stolon-sentinel.service: Failed with result 'exit-code'.
Dec 30 16:44:28 node-3 systemd[1]: stolon-sentinel.service: Service hold-off time over, scheduling restart.
Dec 30 16:44:28 node-3 systemd[1]: Stopped stolon-sentinel.service.
Dec 30 16:44:28 node-3 systemd[1]: Started stolon-sentinel.service.
Dec 30 16:44:28 node-3 stolon-sentinel[28212]: 2017-12-30T16:44:28.937Z        INFO        sentinel/sentinel.go:1724        sentinel uid        {"uid": "809baed5"}
Dec 30 16:44:28 node-3 stolon-sentinel[28212]: 2017-12-30T16:44:28.937Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership
Dec 30 16:44:45 node-3 stolon-sentinel[28212]: 2017-12-30T16:44:45.880Z        INFO        sentinel/sentinel.go:114        sentinel leadership acquired
Dec 30 16:44:45 node-3 stolon-sentinel[28212]: 2017-12-30T16:44:45.924Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 16:44:47 node-3 stolon-sentinel[28212]: 2017-12-30T16:44:47.665Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 16:44:49 node-3 stolon-sentinel[28212]: 2017-12-30T16:44:49.404Z        INFO        sentinel/sentinel.go:953        master db is failed        {"db": "68afe44d", "keeper": "node_1"}
Dec 30 16:44:49 node-3 stolon-sentinel[28212]: 2017-12-30T16:44:49.405Z        INFO        sentinel/sentinel.go:964        trying to find a new master to replace failed master
Dec 30 16:44:49 node-3 stolon-sentinel[28212]: 2017-12-30T16:44:49.405Z        INFO        sentinel/sentinel.go:408        followed instance timeline forked before our current state        {"followedTimeline": 42, "followedXlogpos": 4010267280, "timeline": 41, "xlogpos": 4010268400}
Dec 30 16:44:49 node-3 stolon-sentinel[28212]: panic: runtime error: invalid memory address or nil pointer dereference
Dec 30 16:44:49 node-3 stolon-sentinel[28212]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x8eb8de]
Dec 30 16:44:49 node-3 stolon-sentinel[28212]: goroutine 104 [running]:
Dec 30 16:44:49 node-3 stolon-sentinel[28212]: main.(*Sentinel).dbType(0xc4201a6210, 0xc420142930, 0xc4205047a0, 0x8, 0x2)
Dec 30 16:44:49 node-3 stolon-sentinel[28212]:         /tmp/nix-build-stolon-0.7.0.drv-0/stolon-v0.7.0-src/gopath/src/github.com/sorintlab/stolon/cmd/sentinel/sentinel.go:476 +0xce
Dec 30 16:44:49 node-3 stolon-sentinel[28212]: main.(*Sentinel).validStandbysByStatus(0xc4201a6210, 0xc420142930, 0x400, 0x75, 0x0)
Dec 30 16:44:49 node-3 stolon-sentinel[28212]:         /tmp/nix-build-stolon-0.7.0.drv-0/stolon-v0.7.0-src/gopath/src/github.com/sorintlab/stolon/cmd/sentinel/sentinel.go:652 +0x1b1
Dec 30 16:44:49 node-3 stolon-sentinel[28212]: main.(*Sentinel).findBestStandbys(0xc4201a6210, 0xc420142930, 0xc42025e1a0, 0x614626, 0xc4201a6000, 0x0)
Dec 30 16:44:49 node-3 systemd[1]: stolon-sentinel.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 30 16:44:49 node-3 systemd[1]: stolon-sentinel.service: Unit entered failed state.
Dec 30 16:44:49 node-3 systemd[1]: stolon-sentinel.service: Failed with result 'exit-code'.
Dec 30 16:44:49 node-3 systemd[1]: stolon-sentinel.service: Service hold-off time over, scheduling restart.
Dec 30 16:44:49 node-3 systemd[1]: Stopped stolon-sentinel.service.
Dec 30 16:44:50 node-3 systemd[1]: Started stolon-sentinel.service.
nh2 commented 6 years ago

Tried now with:

# diff -u <(jq . clusterdata.json) <(jq . clusterdata-modified.json)--- /dev/fd/63  2017-12-30 16:55:25.309068684 +0000
+++ /dev/fd/62  2017-12-30 16:55:25.313068732 +0000
@@ -365,11 +365,13 @@
           "timezone": "UTC"
         },
         "pgHBA": null,
-        "role": "master",
+        "role": "standby",
+        "followConfig": {
+          "type": "internal",
+          "dbuid": "ff0af56a"
+        },
         "followers": [],
-        "synchronousStandbys": [
-          "ff0af56a"
-        ]
+        "synchronousStandbys": null
       },
       "status": {
         "healthy": true,

[root@node-3:~]# systemctl restart stolon-keeper

But node-3 logs still say

Dec 30 16:57:23 node-3 stolon-keeper[6016]: 2017-12-30T16:57:23.976Z        INFO        keeper/keeper.go:1314        our db requested role is master
Dec 30 16:57:23 node-3 stolon-keeper[6016]: 2017-12-30T16:57:23.976Z        INFO        keeper/keeper.go:1334        already master

even though the clusterdata says that there's only 1 master now so I'm not sure why it says our db requested role is master.

sgotti commented 6 years ago

@nh2 changing cluster data is a bad thing (also because it wasn't wrong). It'll better if you reinitialize the cluster with initmode existing.

I think that the issue is the sentinel waiting for proxies but the proxy key has a TTL and looks like it isn't expiring (consul issue?). You could try removing the proxy keys by hand.

I'll take a look at the logs in the next days.

nh2 commented 6 years ago

It'll better if you reinitialize the cluster with initmode existing.

I've wiped the clusterdata now and recreated the DB contents from a backup I took just before that (luckily my DB in this case is small enoug that I can restore it very quickly).

sgotti commented 6 years ago

@nh2 I took a look a the logs. The main events are these:

Dec 30 15:38:54 node-1 stolon-sentinel[1338]: 2017-12-30T15:38:54.415Z        INFO        sentinel/sentinel.go:953        master db is failed        {"db": "ff0af56a", "keeper": "node_3"}
Dec 30 15:38:54 node-1 stolon-sentinel[1338]: 2017-12-30T15:38:54.415Z        INFO        sentinel/sentinel.go:964        trying to find a new master to replace failed master
Dec 30 15:38:54 node-1 stolon-sentinel[1338]: 2017-12-30T15:38:54.416Z        INFO        sentinel/sentinel.go:995        electing db as the new master        {"db": "68afe44d", "keeper": "node_1"}

here node1 sentinel (leader sentinel) didn't receive updates from node1 keeper and elected node3 keeper as the new master keeper.

Then it started waiting for proxies to close connection. This continued also after you restarted all the sentinel:

Dec 30 16:10:53 node-3 systemd[1]: Stopping stolon-sentinel.service...
Dec 30 16:10:53 node-3 systemd[1]: Stopped stolon-sentinel.service.
Dec 30 16:10:55 node-3 systemd[1]: Started stolon-sentinel.service.
Dec 30 16:10:55 node-3 stolon-sentinel[32181]: 2017-12-30T16:10:55.011Z        INFO        sentinel/sentinel.go:1724        sentinel uid        {"uid": "184d0de3"}
Dec 30 16:10:55 node-3 stolon-sentinel[32181]: 2017-12-30T16:10:55.011Z        INFO        sentinel/sentinel.go:107        Trying to acquire sentinels leadership
Dec 30 16:22:04 node-3 stolon-sentinel[32181]: 2017-12-30T16:22:04.373Z        INFO        sentinel/sentinel.go:114        sentinel leadership acquired
Dec 30 16:22:05 node-3 stolon-sentinel[32181]: 2017-12-30T16:22:05.295Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 16:22:08 node-3 stolon-sentinel[32181]: 2017-12-30T16:22:08.452Z        ERROR        sentinel/sentinel.go:293        no db state available        {"db": "03d0d281"}
Dec 30 16:22:08 node-3 stolon-sentinel[32181]: 2017-12-30T16:22:08.461Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 16:22:09 node-3 stolon-sentinel[32181]: 2017-12-30T16:22:09.956Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master
Dec 30 16:22:11 node-3 stolon-sentinel[32181]: 2017-12-30T16:22:11.656Z        INFO        sentinel/sentinel.go:1058        waiting for proxies to close connections to old master

the new sentinel leader on node3 continued waiting for proxies to close connections.

main problem

Here I think is the main issue: the proxies should read from the store and update their state, this state key has a ttl so if not updated it should expire after 30 seconds. The sentinel is waiting for the updated state or the key to disappear before continuing to the next state (to be sure that no clients are connecting to the old master).

I don't have the proxies logs but looks like for some reason they weren't updating their state but the key continued to be alive (do you have the proxies logs yet?). Since we rely on the store to respect the ttl looks like something wasn't working correctly inside consul. I know that in etcd if there's a leader change it will renew its leases (in etcd v3 api), perhaps consul was behaving in a similar way continuosly electing a new leader (due to the consul heavy load) and so always renewing keys ttl without letting them expire?

workaround

Assuming that there's no bug in the proxies state update the problem could lie in you current heavy load + consul behavior (keys not expiring)

A workaround could have been to manually delete the proxy keys from consul and see if the sentinel went ahead.

current cluster data

About your cluster data and the issue title Two nodes think they are master this is a normal state after a new master election: the old master is kept as primary so, if the new elected master fails to promote, the previous one can be re-elected if it comes back. This state is the correct one because the sentinel, for the above problem, was waiting for the proxies key to be updated or expired.


But node-3 logs still say

Dec 30 16:57:23 node-3 stolon-keeper[6016]: 2017-12-30T16:57:23.976Z INFO keeper/keeper.go:1314 our db requested role is master

even though the clusterdata says that there's only 1 master now so I'm not sure why it says our db requested role is master.

This is symptomatic of something strange, if you are sure that the clusterdata in consul says that node3 should be a standby than a possible explanation is that consul agent on the node is returning stale data also if we are asking for not stale and consistent reads: https://github.com/docker/libkv/blob/master/store/consul/consul.go#L172-L175

nh2 commented 6 years ago

@sgotti Thanks for your detailed analysis.

consul

It seems likely to me that the issue is as you describe, with Consul not expiring the TTL.

I've found a couple of issues with Consul not expiring TTLs as you'd expect in the past, such as https://github.com/hashicorp/consul/issues/3569.

The nonexpiring-TTL-due-to-Consul-leader-switch under high load also seems likely to me. As mentioned in Consul's docs on sessions:

The contract of a TTL is that it represents a lower bound for invalidation; that is, Consul will not expire the session before the TTL is reached, but it is allowed to delay the expiration past the TTL. The TTL is renewed on session creation, on session renew, and on leader failover.

And as mentioned in the Consul PR that implemented TTLs:

Consul will take the conservative approach of internally multiplying the TTL by 2x

logs

do you have the proxies logs yet?

node-1-proxy-log.txt.gz node-2-proxy-log.txt.gz node-3-proxy-log.txt.gz

Please let me know if the time ranges for these logs are good, otherwise I can extend them.

two masters

About your cluster data and the issue title Two nodes think they are master this is a normal state after a new master election

Ah, I see. You correctly identified that I had the assumption that stolon would only ever indicate one node as master at any given time. Your explanation makes sense to me.

Can you explain a bit more though: Even though there can be times where 2 nodes are indicated as postgres replication masters, Stolon will still ensure that proxies allow only one of them to be connected to by clients to ensure consistency, is that right?

In that case, where in the clusterdata do I have to look to see which master that is?

clusterdata

This is symptomatic of something strange, if you are sure that the clusterdata in consul says that node3 should be a standby

What do you mean by "if you are sure"? I pasted some of the consul kv contents that should show what exactly the clusterdata is -- or do you mean clusterdata from a time that I have not posted?

Regarding stale data, I have run consul monitor -log-level=debug for some time, and | sort | uniq on the output. I see that stolon makes these requests to consul:

GET /v1/kv/stolon/cluster/stolon-cluster/clusterdata?consistent=&wait=10000ms
GET /v1/kv/stolon/cluster/stolon-cluster/keepers/info/node_3?consistent=&index=139912068&wait=10000ms
GET /v1/kv/stolon/cluster/stolon-cluster/keepers/info/node_3?wait=10000ms
GET /v1/kv/stolon/cluster/stolon-cluster/keepers/info?recurse=&wait=10000ms
GET /v1/kv/stolon/cluster/stolon-cluster/proxies/info/67c20067?wait=10000ms
GET /v1/kv/stolon/cluster/stolon-cluster/proxies/info?recurse=&wait=10000ms
GET /v1/kv/stolon/cluster/stolon-cluster/sentinel-leader?consistent=&index=138483838&wait=10000ms
GET /v1/kv/stolon/cluster/stolon-cluster/sentinels/info/d464999c?consistent=&index=139912070&wait=10000ms
GET /v1/kv/stolon/cluster/stolon-cluster/sentinels/info/d464999c?wait=10000ms

As we can see, not all invocations run in consistent mode. I don't know if that's OK or not. Do you think it is OK for those requests that don't have consistent to be that way?

As per consul docs, those that don't use default mode:

default - If not specified, the default is strongly consistent in almost all cases. However, there is a small window in which a new leader may be elected during which the old leader may service stale values. The trade-off is fast reads but potentially stale values. The condition resulting in stale reads is hard to trigger, and most clients should not need to worry about this case. Also, note that this race condition only applies to reads, not writes.

I don't buy the hard to trigger and most clients should not need to worry in my own uses of Consul, especially given how easily disk seek contention can force a leader election, so it seems we should treat default the same way as stale.

sgotti commented 6 years ago

Please let me know if the time ranges for these logs are good, otherwise I can extend them.

Dec 30 15:39:19 node-1 stolon-proxy[1335]: 2017-12-30T15:39:19.142Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}
Dec 30 15:39:24 node-1 stolon-proxy[1335]: 2017-12-30T15:39:24.579Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}

[...]

Dec 30 16:21:43 node-1 stolon-proxy[26918]: 2017-12-30T16:21:43.776Z        INFO        proxy/proxy.go:241        no db object available, closing connections to master        {"db": ""}

From the logs looks like they read that they shouldn't proxy to any node and should have updated their status. The problems is that looks like one of these has a state in consul behind one generation (293 instead of 294) and it's probably the one the sentinel is waiting for:

stolon/cluster/stolon-cluster/proxies/info/9fd707d7:{"UID":"9fd707d7","Generation":293}
stolon/cluster/stolon-cluster/proxies/info/da5fd3f9:{"UID":"da5fd3f9","Generation":294}
stolon/cluster/stolon-cluster/proxies/info/f9cea2d0:{"UID":"f9cea2d0","Generation":294}

But from the proxies logs looks like they're correctly updating their state (no errors). So, or they are reading an old proxy generation from cluster data (stale data) or it's coming from a fourth proxy (unfortunately the logs don't report the proxy uid that is generated at every process start).

Another strange thing is that after you restarted all the proxies it's already there:

stolon/cluster/stolon-cluster/proxies/info/0c2f474e:{"UID":"0c2f474e","Generation":294}
stolon/cluster/stolon-cluster/proxies/info/276e61af:{"UID":"276e61af","Generation":294}
stolon/cluster/stolon-cluster/proxies/info/506859ac:{"UID":"506859ac","Generation":294}
stolon/cluster/stolon-cluster/proxies/info/9fd707d7:{"UID":"9fd707d7","Generation":293}

looks like consul isn't expiring it (or it's a fourth proxy).

Can you explain a bit more though: Even though there can be times where 2 nodes are indicated as postgres replication masters, Stolon will still ensure that proxies allow only one of them to be connected to by clients to ensure consistency, is that right?

Right.

In that case, where in the clusterdata do I have to look to see which master that is?

In the cluster object status:

{                                                                                      
  "FormatVersion": 1,                                                                  
  "Cluster": {                                                                         
    "uid": "eb69e690",                                                                  
[...]
    "status": {                                                                        
      "phase": "normal",                                                               
      "master": "68afe44d"   <<< Current master                       
    }                                                                                  
  },                                                                                   
  "Keepers": {  
[...]

What do you mean by "if you are sure"?

I mean if you are sure that the modified cluster data was correctly written to consul and not lost for whatever reason.

As we can see, not all invocations run in consistent mode. I don't know if that's OK or not. Do you think it is OK for those requests that don't have consistent to be that way?

Good catch, that's definitely a bug in libkv where the List method doesn't set consistent like done in the get method. The List method is only used to fetch keeper/proxy/sentinel state in stolonctl and from the sentinel. Fortunately it's not used to get cluster data. But it have to be fixed. Given the libkv status I'll open a PR but not sure if/when it'll be merged (if not done I'll have to use a fork to updated stolon vendored dependencies).

nh2 commented 6 years ago

looks like consul isn't expiring it (or it's a fourth proxy).

This I remember from the time: Even though I used only 3 machines with stolon-proxy managed by systemd (so no machine should have more than 1 stolon-proxy running at any given time), at some point after a restart a fourt proxy entry appeared in the consul KV store.

The 4th proxy already appears in my consul dump from https://github.com/sorintlab/stolon/issues/397#issuecomment-354554901 but not in the one further up in the issue description.

Should the 4th proxy not have appeared? What method does stolon use to identify proxies across restarts? Maybe this, too, can be attributed to consul keeping some keys for very long.

sgotti commented 6 years ago

Should the 4th proxy not have appeared?

The 4th proxy was only a guess but if you said that you don't have other proxies and it disappeared later than it was probably a key that took more time to expire. Though I don't understand why it was reporting a proxy generation of 293 instead of 294.

What method does stolon use to identify proxies across restarts? Maybe this, too, can be attributed to consul keeping some keys for very long.

Every time a proxy process starts it generates a unique id so at every start it's like a new proxy. Keys of old proxies in the store should then expire after the ttl (we could also improve the proxy to remove the keys at shutdown).

nh2 commented 6 years ago

@sgotti Right now I had the same problem again (interestingly roughly 1 month after last time).

This time I found the issue much faster. Like last time consul kv put test hellooo just hangs, so consul just times out any PUT requests, without there being recent errors in the consul logs.

As a result, stolonctl status just shows vastly outdated data.

Restarting consul on all nodes immediately fixes the problem.


I'm thinking it would be very useful if stolonctl status could itself give the user some indication of how recent the printed data is, maybe even print a warning when the data Stolon stores in Consul has a vastly too old timestamp compared to the machine's current time.

This should make it much easier to debug these issues, and making this case obvious when it's not Stolon's fault at all that things aren't working (but with the stolonctl status command showing oudated info and thus it claiming that everything is working fine when in fact it isn't -- e.g. it shows proxies are working when they are in fact not accepting connections -- it is easy to think that it's Stolon that has a bug here).

What do you think about this idea?

nh2 commented 6 years ago

I just got into this issue again with consul hanging.

This time I also learned that consul info, and the endpoint it uses, curl http://localhost:8500/v1/agent/self, are both hanging indefinitely.

Judging from that, it might be this consul issue: https://github.com/hashicorp/consul/issues/3700

nh2 commented 6 years ago

I tried to obtain debug goroutine stacktraces from consul as done in https://github.com/hashicorp/consul/issues/3700, but I think I didn't manage to do so in a comparable way because enable_debug was off (the default) in my consul config.

nh2 commented 6 years ago

After restarting consul and stolon this time, the stolonctl status output went strange again, with zombie entries like we observed before.

After restart of consul, with the consul DB writable again, this was the output:

[root@node-1:~]# stolonctl status --cluster-name stolon-cluster --store-backend=consul
=== Active sentinels ===

ID    LEADER
15865f93  false
66eeb7d9  true
d0e95f7b  false

=== Active proxies ===

ID
655e44b3
7484c5dc
d28a3b6d

=== Keepers ===

UID   HEALTHY PG LISTENADDRESS  PG HEALTHY  PG WANTEDGENERATION PG CURRENTGENERATION
node_1 true  10.0.0.1:5432   true    2     2 
node_2 true  10.0.0.2:5432   true    4     4 
node_3 true  10.0.0.3:5432   true    23      23  

=== Cluster Info ===

Master: node_2

===== Keepers/DB tree =====

node_2 (master)

Both node_2 and node_3 had already master in the logs.

At that point, stonlon-sentinel and stolon-proxy were dead on node 1.

After systemctl start stolon-sentinel stolon-proxy on node 1, it changed to:

[root@node-1:~]# stolonctl status --cluster-name stolon-cluster --store-backend=consul
=== Active sentinels ===

ID    LEADER
15865f93  false
58f9824f  false
66eeb7d9  true
d0e95f7b  false

=== Active proxies ===

ID
4d49ce9c
655e44b3
7484c5dc
d28a3b6d

=== Keepers ===

UID   HEALTHY PG LISTENADDRESS  PG HEALTHY  PG WANTEDGENERATION PG CURRENTGENERATION
node_1 true  10.0.0.1:5432   true    2     2 
node_2 true  10.0.0.2:5432   true    4     4 
node_3 true  10.0.0.3:5432   true    23      23  

=== Cluster Info ===

Master: node_2

===== Keepers/DB tree =====

node_2 (master)

So here we have the situation again where likely consul is not expiring keys, thus we have 4 entries for sentinels and proxies even though my cluster has only 3 nodes, and no node can run more than one sentinel/proxy/keeper process as those are managed by systemd.

nh2 commented 6 years ago

Even after a reboot of all machines involved, stolon did not recover from only node_2 (master) being listed in the tree.

sgotti commented 6 years ago

@nh2 If the sentinel is waiting for proxies to close connection forever then it's consul not expiring the key of an old proxy. You should gather some debug info (if any) and the just remove the key to make your cluster work again.

BTW in current master #427 changed to logic to not rely on store ttl.

nh2 commented 6 years ago

It'll better if you reinitialize the cluster with initmode existing.

@sgotti I'm trying to fix the cluster this way this time. I'm using

stolonctl init --cluster-name stolon-cluster --store-backend=consul '{ "initMode": "existing", "existingConfig": { "keeperUID": "node_2" } }'

and it did something, but nodes 1 and 3 did not rejoin the cluster:

[root@node-1:~]# stolonctl status --cluster-name stolon-cluster --store-backend=consul
=== Active sentinels ===

ID    LEADER
02f0af0a  true
0c940ed9  false
92f15801  false
d0e95f7b  false

=== Active proxies ===

ID
0a473d70
0fcd3383
432b5083
d28a3b6d

=== Keepers ===

UID   HEALTHY PG LISTENADDRESS  PG HEALTHY  PG WANTEDGENERATION PG CURRENTGENERATION
node_1 true  (no db assigned)              
node_2 true  10.0.0.2:5432   true    2     2 
node_3 true  (no db assigned)              

=== Cluster Info ===

Master: node_2

===== Keepers/DB tree =====

node_2 (master)

After this, nodes 1 and 3 have no db assigned in the logs, like:

Feb 13 13:58:33 node-1 stolon-keeper[1024]: 2018-02-13T13:58:33.452Z        INFO        keeper/keeper.go:1367        already standby
Feb 13 13:58:33 node-1 stolon-keeper[1024]: 2018-02-13T13:58:33.453Z        INFO        keeper/keeper.go:1489        postgres parameters not changed
Feb 13 13:58:33 node-1 stolon-keeper[1024]: 2018-02-13T13:58:33.453Z        INFO        keeper/keeper.go:1498        postgres hba entries not changed
Feb 13 13:58:35 node-1 stolon-keeper[1024]: 2018-02-13T13:58:35.631Z        INFO        keeper/keeper.go:870        our keeper data is not available, waiting for it to appear
Feb 13 13:58:40 node-1 stolon-keeper[1024]: 2018-02-13T13:58:40.635Z        INFO        keeper/keeper.go:870        our keeper data is not available, waiting for it to appear
Feb 13 13:58:45 node-1 stolon-keeper[1024]: 2018-02-13T13:58:45.638Z        INFO        keeper/keeper.go:876        no db assigned
Feb 13 13:58:50 node-1 stolon-keeper[1024]: 2018-02-13T13:58:50.639Z        INFO        keeper/keeper.go:876        no db assigned
[snip]
Feb 13 14:01:15 node-1 stolon-keeper[1024]: 2018-02-13T14:01:15.814Z        INFO        keeper/keeper.go:876        no db assigned
Feb 13 14:01:20 node-1 stolon-keeper[1024]: 2018-02-13T14:01:20.850Z        INFO        keeper/keeper.go:876        no db assigned
Feb 13 14:01:25 node-1 stolon-keeper[1024]: 2018-02-13T14:01:25.852Z        INFO        keeper/keeper.go:876        no db assigned
Feb 13 14:01:28 node-1 systemd[1]: Stopping stolon-keeper.service...
Feb 13 14:01:28 node-1 stolon-keeper[1024]: 2018-02-13T14:01:28.262Z        INFO        postgresql/postgresql.go:264        stopping database
Feb 13 14:01:29 node-1 stolon-keeper[1024]: waiting for server to shut down.... done
Feb 13 14:01:29 node-1 stolon-keeper[1024]: server stopped
Feb 13 14:01:29 node-1 systemd[1]: Stopped stolon-keeper.service.
Feb 13 14:01:29 node-1 systemd[1]: Started stolon-keeper.service.
Feb 13 14:01:29 node-1 stolon-keeper[7185]: warning: superuser name and replication user name are the same. Different users are suggested.
Feb 13 14:01:29 node-1 stolon-keeper[7185]: 2018-02-13T14:01:29.613Z        INFO        keeper/keeper.go:1672        exclusive lock on data dir taken
Feb 13 14:01:29 node-1 stolon-keeper[7185]: 2018-02-13T14:01:29.613Z        INFO        keeper/keeper.go:442        keeper uid        {"uid": "node_1"}
Feb 13 14:01:29 node-1 stolon-keeper[7185]: 2018-02-13T14:01:29.659Z        INFO        postgresql/postgresql.go:264        stopping database
Feb 13 14:01:29 node-1 stolon-keeper[7185]: pg_ctl: PID file "/var/lib/postgres/node_1/postgres/postmaster.pid" does not exist
Feb 13 14:01:29 node-1 stolon-keeper[7185]: Is server running?
Feb 13 14:01:29 node-1 stolon-keeper[7185]: 2018-02-13T14:01:29.660Z        ERROR        keeper/keeper.go:560        cannot get configured pg parameters        {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
Feb 13 14:01:29 node-1 stolon-keeper[7185]: 2018-02-13T14:01:29.779Z        INFO        keeper/keeper.go:876        no db assigned
Feb 13 14:01:32 node-1 stolon-keeper[7185]: 2018-02-13T14:01:32.161Z        ERROR        keeper/keeper.go:560        cannot get configured pg parameters        {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
Feb 13 14:01:34 node-1 stolon-keeper[7185]: 2018-02-13T14:01:34.661Z        ERROR        keeper/keeper.go:560        cannot get configured pg parameters        {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
Feb 13 14:01:34 node-1 stolon-keeper[7185]: 2018-02-13T14:01:34.781Z        INFO        keeper/keeper.go:876        no db assigned

Did I use existing correctly? What is happening in this case?

sgotti commented 6 years ago

@nh2

Why haven't you just removed the proxy key unexpired by consul?

I need the sentinel logs to see what's happening.

PS gitter is probably better for support like this...

nh2 commented 6 years ago

OK, with support help of @sgotti on Gitter I have this time recovered from the issue without having to wipe the DB:

After running the stolonctl init with existing above, I only had to remove the stale proxy key:

$ consul kv get -recurse | grep proxies
stolon/cluster/stolon-cluster/proxies/info/0a473d70:{"UID":"0a473d70","Generation":0}
stolon/cluster/stolon-cluster/proxies/info/0fcd3383:{"UID":"0fcd3383","Generation":0}
stolon/cluster/stolon-cluster/proxies/info/432b5083:{"UID":"432b5083","Generation":0}
stolon/cluster/stolon-cluster/proxies/info/d28a3b6d:{"UID":"d28a3b6d","Generation":140}

The stale proxy key is the last entry; I also confirmed this by checking the UIDs against journalctl -u stolon-proxy -e | grep 'proxy uid' on each node.

With one

consul kv delete stolon/cluster/stolon-cluster/proxies/info/d28a3b6d

the system recoverd to

===== Keepers/DB tree =====

node_2 (master)
├─node_3
└─node_1
sgotti commented 6 years ago

@nh2 I changes the issue title.

Since this issue looks related by consul bad behavior I'd like to close this.

With #427, for other reasons, we don't rely only on key ttl but the sentinel does it's own internal timing to remove stale proxies.

nh2 commented 6 years ago

@sgotti That sounds fair to me.

Do you think it would make sense to adjust some of the log messages though to make them more explicit?

E.g. to expand or replace already master by something that distinguishes it from what the actual "master" (e.g. (master) in the Keepers/DB tree) is, so that users like me get less easily confused and incorrectly assume that two nodes logging already master was a buggy state.

sgotti commented 6 years ago

@nh2 I'm open to any improvement, please open new issues or PRs for every proposed change. Closing.