Cluster down after all k8s nodes rebooted

What happened: I have new baremetal k8s cluster on my desk. Due power failure it rebooted twice. I didn't check stolon status after first reboot. If it is important I can extract logs from Elastic. After second reboot stolon cluster didn't came up.

I was able to bring it up by deleting sentinel pods one by one. This behaviour is strange and have to be addressed.

Honestly, I have no idea why it break and why it recovered. Since I have few production instalations stolon I'm very interested to track source so you can fix it.

Helm vars

#   helm install stolon stable/stolon --version 1.6.1 --namespace db -f deploy/helm/stolon-postgresql.yaml
image:
  repository: 127.0.0.1:5000/stolon
  tag: "v0.16.0-pg11"
pgParameters:
  max_connections: "800"
  shared_buffers: 5GB
  effective_cache_size: 15GB
  maintenance_work_mem: 1280MB
  checkpoint_completion_target: "0.9"
  wal_buffers: 16MB
  default_statistics_target: "100"
  random_page_cost: "1.1"
  effective_io_concurrency: "300"
  work_mem: 3276kB
  min_wal_size: 2GB
  max_wal_size: 8GB
  max_worker_processes: "8"
  max_parallel_workers_per_gather: "4"
  max_parallel_workers: "8"
  max_parallel_maintenance_workers: "4"
persistence:
  storageClassName: local-storage
  size: 20G
keeper:
  resources:
    requests:
      cpu: 2000m
      memory: 2Gi
superuserSecret:
  name: superuser-cred
replicationSecret:
  name: replication-cred

serviceMonitor:
  enabled: true

Image 127.0.0.1:5000/stolon build with Dockerfile:

FROM sorintlab/stolon:v0.16.0-pg11
RUN localedef -i ru_RU -c -f UTF-8 -A /usr/share/locale/locale.alias ru_RU.UTF-8

Failure state

kubectl -n db get pod

NAME                               READY   STATUS    RESTARTS   AGE
stolon-keeper-0                    1/1     Running   4          3d18h
stolon-keeper-1                    1/1     Running   1          3h6m
stolon-proxy-756cb878f-j6v9d       1/1     Running   1          3h11m
stolon-proxy-756cb878f-vc2vz       1/1     Running   1          3h11m
stolon-sentinel-85969d666d-bjpt7   1/1     Running   1          3h11m
stolon-sentinel-85969d666d-s2k68   1/1     Running   1          3h11m

stolon-keeper-0

2020-10-30T08:36:13.599Z    WARN    cmd/keeper.go:182   password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...   {"file": "/etc/secrets/stolon-replication-cred/pg_repl_password", "mode": "01000000777"}
2020-10-30T08:36:13.600Z    WARN    cmd/keeper.go:182   password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...   {"file": "/etc/secrets/stolon-superuser-cred/pg_su_password", "mode": "01000000777"}
2020-10-30T08:36:13.600Z    INFO    cmd/keeper.go:2039  exclusive lock on data dir taken
2020-10-30T08:36:13.605Z    INFO    cmd/keeper.go:525   keeper uid  {"uid": "keeper0"}
2020-10-30T08:36:13.642Z    ERROR   cmd/keeper.go:673   cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:36:13.653Z    INFO    cmd/keeper.go:1047  our db boot UID is different than the cluster data one, waiting for it to be updated    {"bootUUID": "69b3fc3f-7914-4a4e-8d3b-afdf443c6d18", "clusterBootUUID": "bda6e01c-139e-4be4-96fa-4fbe003c9db3"}

stolon-keeper-1

2020-10-30T08:31:59.938Z    WARN    cmd/keeper.go:182   password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...   {"file": "/etc/secrets/stolon-replication-cred/pg_repl_password", "mode": "01000000777"}
2020-10-30T08:31:59.947Z    WARN    cmd/keeper.go:182   password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...   {"file": "/etc/secrets/stolon-superuser-cred/pg_su_password", "mode": "01000000777"}
2020-10-30T08:31:59.948Z    INFO    cmd/keeper.go:2039  exclusive lock on data dir taken
2020-10-30T08:31:59.960Z    INFO    cmd/keeper.go:525   keeper uid  {"uid": "keeper1"}
2020-10-30T08:32:04.963Z    ERROR   cmd/keeper.go:765   error retrieving cluster data   {"error": "failed to get latest version of configmap: Get https://10.12.0.1:443/api/v1/namespaces/db/configmaps/stolon-cluster-stolon?timeout=5s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:04.974Z    ERROR   cmd/keeper.go:673   cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:07.474Z    ERROR   cmd/keeper.go:673   cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:09.973Z    ERROR   cmd/keeper.go:995   error retrieving cluster data   {"error": "failed to get latest version of configmap: Get https://10.12.0.1:443/api/v1/namespaces/db/configmaps/stolon-cluster-stolon?timeout=5s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:09.975Z    ERROR   cmd/keeper.go:673   cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:10.058Z    ERROR   cmd/keeper.go:824   failed to update keeper info    {"error": "update failed: failed to get latest version of pod: Get https://10.12.0.1:443/api/v1/namespaces/db/pods/stolon-keeper-1?timeout=5s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
2020-10-30T08:32:12.476Z    ERROR   cmd/keeper.go:673   cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:14.977Z    ERROR   cmd/keeper.go:673   cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T08:32:15.008Z    INFO    cmd/keeper.go:1047  our db boot UID is different than the cluster data one, waiting for it to be updated    {"bootUUID": "d686cc6e-f513-4dba-a8c1-84a98759018e", "clusterBootUUID": "4a2b3274-b174-4733-817d-9c45d79f6479"}

For sentenels I got only last 3 log lines on my terminal, but if is important - I can extract startup messages from Elastic.

kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-bjpt7

E1030 10:45:50.732489       1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)
2020-10-30T10:45:53.992Z    INFO    cmd/sentinel.go:1830    no cluster data available, waiting for it to appear
E1030 10:45:54.538881       1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)

kubectl -n db logs --tail=3 stolon-sentinel-85969d666d-s2k68

2020-10-30T10:46:02.512Z    INFO    cmd/sentinel.go:1830    no cluster data available, waiting for it to appear
E1030 10:46:05.094403       1 leaderelection.go:335] error initially creating leader election record: the server could not find the requested resource (post configmaps)
2020-10-30T10:46:07.514Z    INFO    cmd/sentinel.go:1830    no cluster data available, waiting for it to appear

kubectl -n db logs --tail=3 stolon-proxy-756cb878f-j6v9d

2020-10-30T10:46:17.615Z    INFO    cmd/proxy.go:203    no clusterdata available, closing connections to master
2020-10-30T10:46:22.616Z    INFO    cmd/proxy.go:203    no clusterdata available, closing connections to master
2020-10-30T10:46:27.618Z    INFO    cmd/proxy.go:203    no clusterdata available, closing connections to master

root@stolon-sentinel-85969d666d-s2k68:/# stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap status

failed to get leader election record: the server could not find the requested resource (get configmaps stolon-cluster-stolon)

root@stolon-sentinel-85969d666d-s2k68:/# stolonctl --cluster-name stolon --store-backend kubernetes --kube-resource-kind configmap clusterdata read

nil cluster data: <nil>

kubectl -n db get configmaps stolon-cluster-stolon -o yaml

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"4a5ed418","leaseDurationSeconds":15,"acquireTime":"2020-10-30T07:32:14Z","renewTime":"2020-10-30T08:22:29Z","leaderTransitions":4}'
    stolon-clusterdata: '{"formatVersion":1,"changeTime":"2020-10-30T08:22:28.33612147Z","cluster":{"uid":"99e55388","generation":1,"changeTime":"2020-10-30T07:33:18.825516594Z","spec":{"additionalWalSenders":null,"additionalMasterReplicationSlots":null,"initMode":"new","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"automaticPgRestart":null},"status":{"phase":"normal","master":"0db3798e"}},"keepers":{"keeper0":{"uid":"keeper0","generation":1,"changeTime":"2020-10-30T08:22:28.33659473Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T08:22:28.323974995Z","bootUUID":"bda6e01c-139e-4be4-96fa-4fbe003c9db3","postgresBinaryVersion":{"Maj":11,"Min":7}}},"keeper1":{"uid":"keeper1","generation":1,"changeTime":"2020-10-30T08:22:28.336588896Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T08:22:28.323975862Z","bootUUID":"4a2b3274-b174-4733-817d-9c45d79f6479","postgresBinaryVersion":{"Maj":11,"Min":7}}}},"dbs":{"0db3798e":{"uid":"0db3798e","generation":4,"changeTime":"2020-10-30T08:22:28.336414945Z","spec":{"keeperUID":"keeper0","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"master","followers":["f1fd6966"],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":4,"listenAddress":"10.12.73.74","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":692549936,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000020000000000000021"}},"f1fd6966":{"uid":"f1fd6966","generation":2,"changeTime":"2020-10-30T08:22:28.336579423Z","spec":{"keeperUID":"keeper1","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"standby","followConfig":{"type":"internal","dbuid":"0db3798e"},"followers":[],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":2,"listenAddress":"10.12.99.199","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":692549936,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000030000000000000028"}}},"proxy":{"generation":13,"changeTime":"2020-10-30T07:33:23.96407502Z","spec":{"masterDbUid":"0db3798e","enabledProxies":["029c5918","5ce0e683"]},"status":{}}}'
  creationTimestamp: "2020-10-16T07:38:59Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:control-plane.alpha.kubernetes.io/leader: {}
          f:stolon-clusterdata: {}
    manager: stolon-sentinel
    operation: Update
    time: "2020-10-30T08:22:29Z"
  name: stolon-cluster-stolon
  namespace: db
  resourceVersion: "14867071"
  selfLink: /api/v1/namespaces/db/configmaps/stolon-cluster-stolon
  uid: 13d4175d-7930-4632-8c77-0ab583f94e33

(blind) Recovery steppes

1) delete one sentinel

kubectl -n db delete pod stolon-sentinel-85969d666d-bjpt7

pod "stolon-sentinel-85969d666d-bjpt7" deleted

kubectl -n db logs stolon-sentinel-85969d666d-dtqc5

2020-10-30T11:11:41.213Z    INFO    cmd/sentinel.go:1964    sentinel uid    {"uid": "012634c9"}
2020-10-30T11:11:41.229Z    INFO    cmd/sentinel.go:82  Trying to acquire sentinels leadership
I1030 11:11:41.231909       1 leaderelection.go:242] attempting to acquire leader lease  db/stolon-cluster-stolon...
I1030 11:11:58.717346       1 leaderelection.go:252] successfully acquired lease db/stolon-cluster-stolon
2020-10-30T11:11:58.717Z    INFO    cmd/sentinel.go:89  sentinel leadership acquired
E1030 11:12:06.857376       1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on configmaps "stolon-cluster-stolon": the object has been modified; please apply your changes to the latest version and try again

kubectl -n db logs stolon-sentinel-85969d666d-s2k68 --tail (no changes)

2) delete second sentinel

kubectl -n db delete pod stolon-sentinel-85969d666d-s2k68

pod "stolon-sentinel-85969d666d-s2k68" deleted

kubectl -n db logs stolon-sentinel-85969d666d-pd2vx

2020-10-30T11:12:41.266Z    INFO    cmd/sentinel.go:1964    sentinel uid    {"uid": "ed08f2bc"}
2020-10-30T11:12:41.275Z    INFO    cmd/sentinel.go:82  Trying to acquire sentinels leadership
I1030 11:12:41.275991       1 leaderelection.go:242] attempting to acquire leader lease  db/stolon-cluster-stolon...

Part logs of stolon-keeper-0

2020-10-30T11:12:06.438Z    INFO    cmd/keeper.go:1047  our db boot UID is different than the cluster data one, waiting for it to be updated    {"bootUUID": "69b3fc3f-7914-4a4e-8d3b-afdf443c6d18", "clusterBootUUID": "bda6e01c-139e-4be4-96fa-4fbe003c9db3"}
2020-10-30T11:12:08.668Z    ERROR   cmd/keeper.go:673   cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T11:12:11.169Z    ERROR   cmd/keeper.go:673   cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2020-10-30T11:12:11.511Z    INFO    cmd/keeper.go:1457  our db requested role is master
2020-10-30T11:12:11.542Z    INFO    postgresql/postgresql.go:319    starting database
2020-10-30 11:12:11.591 UTC [3761] LOG:  listening on IPv4 address "10.12.73.96", port 5432
2020-10-30 11:12:11.594 UTC [3761] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-10-30 11:12:11.844 UTC [3762] LOG:  database system was interrupted; last known up at 2020-10-30 08:22:10 UTC
2020-10-30 11:12:11.851 UTC [3764] FATAL:  the database system is starting up
2020-10-30 11:12:11.861 UTC [3763] FATAL:  the database system is starting up
2020-10-30 11:12:11.872 UTC [3765] FATAL:  the database system is starting up
2020-10-30 11:12:11.881 UTC [3766] FATAL:  the database system is starting up
2020-10-30 11:12:11.969 UTC [3762] LOG:  database system was not properly shut down; automatic recovery in progress
2020-10-30 11:12:11.974 UTC [3762] LOG:  redo starts at 0/29475B58
2020-10-30 11:12:11.975 UTC [3762] LOG:  invalid record length at 0/29477D18: wanted 24, got 0
2020-10-30 11:12:11.975 UTC [3762] LOG:  redo done at 0/29477CC0
2020-10-30 11:12:11.975 UTC [3762] LOG:  last completed transaction was at log time 2020-10-30 08:22:28.71837+00
2020-10-30 11:12:11.992 UTC [3767] FATAL:  the database system is starting up
2020-10-30 11:12:12.037 UTC [3761] LOG:  database system is ready to accept connections
2020-10-30T11:12:12.079Z    INFO    cmd/keeper.go:1493  already master
2020-10-30T11:12:12.111Z    INFO    cmd/keeper.go:1626  postgres parameters not changed

kubectl -n db get configmaps stolon-cluster-stolon -o yaml

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"91aa7090","leaseDurationSeconds":15,"acquireTime":"2020-10-30T11:15:41Z","renewTime":"2020-10-30T11:36:54Z","leaderTransitions":6}'
    stolon-clusterdata: '{"formatVersion":1,"changeTime":"2020-10-30T11:36:52.818078244Z","cluster":{"uid":"99e55388","generation":1,"changeTime":"2020-10-30T07:33:18.825516594Z","spec":{"additionalWalSenders":null,"additionalMasterReplicationSlots":null,"initMode":"new","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"automaticPgRestart":null},"status":{"phase":"normal","master":"0db3798e"}},"keepers":{"keeper0":{"uid":"keeper0","generation":1,"changeTime":"2020-10-30T11:36:52.818190214Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T11:36:52.811756433Z","bootUUID":"69b3fc3f-7914-4a4e-8d3b-afdf443c6d18","postgresBinaryVersion":{"Maj":11,"Min":7}}},"keeper1":{"uid":"keeper1","generation":1,"changeTime":"2020-10-30T11:36:52.818192157Z","spec":{},"status":{"healthy":true,"lastHealthyTime":"2020-10-30T11:36:52.811757286Z","bootUUID":"d686cc6e-f513-4dba-a8c1-84a98759018e","postgresBinaryVersion":{"Maj":11,"Min":7}}}},"dbs":{"0db3798e":{"uid":"0db3798e","generation":4,"changeTime":"2020-10-30T11:36:52.818162694Z","spec":{"keeperUID":"keeper0","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"master","followers":["f1fd6966"],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":4,"listenAddress":"10.12.73.96","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":696785736,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000020000000000000021"}},"f1fd6966":{"uid":"f1fd6966","generation":2,"changeTime":"2020-10-30T11:36:52.818188051Z","spec":{"keeperUID":"keeper1","requestTimeout":"10s","maxStandbys":20,"additionalWalSenders":5,"additionalReplicationSlots":null,"initMode":"none","pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"pgHBA":null,"role":"standby","followConfig":{"type":"internal","dbuid":"0db3798e"},"followers":[],"synchronousStandbys":null,"externalSynchronousStandbys":null},"status":{"healthy":true,"currentGeneration":2,"listenAddress":"10.12.99.218","port":"5432","systemdID":"6884128581333794863","timelineID":3,"xLogPos":696785736,"timelinesHistory":[{"timelineID":1,"switchPoint":434393736,"reason":"no
      recovery target specified"},{"timelineID":2,"switchPoint":668609472,"reason":"no
      recovery target specified"}],"pgParameters":{"checkpoint_completion_target":"0.9","datestyle":"iso,
      mdy","default_statistics_target":"100","default_text_search_config":"pg_catalog.english","dynamic_shared_memory_type":"posix","effective_cache_size":"15GB","effective_io_concurrency":"300","lc_messages":"en_US.utf8","lc_monetary":"en_US.utf8","lc_numeric":"en_US.utf8","lc_time":"en_US.utf8","log_timezone":"Etc/UTC","maintenance_work_mem":"1280MB","max_connections":"800","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_wal_size":"8GB","max_worker_processes":"8","min_wal_size":"2GB","random_page_cost":"1.1","shared_buffers":"5GB","ssl":"off","timezone":"Etc/UTC","wal_buffers":"16MB","wal_level":"replica","work_mem":"3276kB"},"synchronousStandbys":null,"olderWalFile":"000000030000000000000028"}}},"proxy":{"generation":17,"changeTime":"2020-10-30T11:16:15.29551675Z","spec":{"masterDbUid":"0db3798e","enabledProxies":["52052d64"]},"status":{}}}'
  creationTimestamp: "2020-10-16T07:38:59Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:control-plane.alpha.kubernetes.io/leader: {}
          f:stolon-clusterdata: {}
    manager: stolon-sentinel
    operation: Update
    time: "2020-10-30T11:36:54Z"
  name: stolon-cluster-stolon
  namespace: db
  resourceVersion: "14940223"
  selfLink: /api/v1/namespaces/db/configmaps/stolon-cluster-stolon
  uid: 13d4175d-7930-4632-8c77-0ab583f94e33

What you expected to happen: Cluster should survive sudden reboot without looses.

How to reproduce it (as minimally and precisely as possible): Play around with power source ¯_(ツ)_/¯

Anything else we need to know?:

Environment:

Stolon version: v0.16.0-pg11
Stolon running environment (if useful to understand the bug): k8s baremetal, installed with Helm
Others: I just notice my Helm chart for 1.6.1 and my actual stolon version is 1.6.0

sorintlab / stolon

Cluster down after all k8s nodes rebooted #806