Promotion of outdated DB led to losing data of the last 6 days

On our production cluster of 3 DBs with sync replication, running Stolon 0.10 with cherry-picked workarounds for #494 (we are preparing to upgrade to latest Stolon but need to be very careful with it), according to my current state of knowledge:

What happened:

The cluster degenerated into a state where a master was written to and the other two nodes did not follow it, for a week. Afer that time, one of the other DBs got promoted to master carrying 6-days old data, so a lot of things our customers had created since then disappeared.

Mitigation:

We managed to avoid any harm by quickly identifying the problem, and restoring the data from the underlying postgres DB that still had the newer data, eventually ensuring that no customer data was lost.

We wrote a portmortem for it and sharing the Stolon details here, in the hope that we can determine underlying issue and what can be done against it.

My analysis so far

From the stolon-keeper logs, I identified 2 relevant events in time.

In the first event, one machine becomes the sole master of the cluster with all replication to the other nodes stopping. In the second event, another machine with stale data becomes master.

The 3 machines involved are called node-1, node-2 and node-3. The consul errors mentioned are due to a short disruption of network connectivity that Stolon should likely be able to handle. I write below a short summary of what I concluded from the logs (log files attached to verify).

Event 1: 2020-04-24T02:49:00.625Z
- node-1
- following 43769c6e (node_2) -> consul error, rewind, diverged WAL, following e25ff760 (node_3)
- node-2
- master -> consul error, boot UID is different, stopping DB, current UID different {"db": "43769c6e", "cdDB": "50db7617"} -> db stays down
- node-3
- following 43769c6e (node_2) -> consul error, starting DB -> master
  - needed synchronous_standby_names changed {"prevSyncStandbyNames": "stolon_261d3df3", "syncStandbyNames": "stolonfakestandby"}
Event 2: 2020-04-30T14:54:02.021Z
- node-1
- following e25ff760 (node_3) -> db UID different -> following 43769c6e (node_2) -> connection parameters changed, restarting DB -> following 43769c6e (node_2)
- node-2
- dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory since Event 1 -> db UID different, starting DB -> master
  - needed synchronous_standby_names changed {"prevSyncStandbyNames": "2 (stolon_727e89a6,stolon_dbc050dc)", "syncStandbyNames": "stolon_dbc050dc"} (dbc050dc is node_3)
- node-3
- master -> db UID different -> cannot move from master role to standby role (to follow 43769c6e, node_2)

Relevant log extracts

event-1-node-1-keeper.txt event-1-node-1-proxy.txt event-1-node-2-keeper.txt event-1-node-2-proxy.txt event-1-node-3-keeper.txt event-1-node-3-proxy.txt event-2-node-1-keeper.txt event-2-node-1-proxy.txt event-2-node-2-keeper.txt event-2-node-2-proxy.txt event-2-node-3-keeper.txt event-2-node-3-proxy.txt node-1-sentinel.txt node-2-sentinel.txt node-3-sentinel.txt

Summary

At Event 1, node-3 became master with no standbys mirroring it (syncStandbyNames == stolonfakestandby).
- node-2 was down from here, node-1 claimed to follow the master node-3
At event 2, node-2 starts its DB again and becomes master (with old contents!).
- node-1 follows it. node-3 is told to follow it but cannot (which is what allowed us to rescue the data).

Important questions

Why did the cluster accept writes after Event 1 (the assumption is it should only accept writes with at least 2-redundancy)?
How could node-2 become master even though it had old contents?

Thank you!

From the 3 attached sentinel logs, I believe this excerpt from node-3 might be most relevant (but you can judge that better):

2020-04-24T02:49:12.780Z        ERROR        cmd/sentinel.go:114        election loop error        {"error": "Put \"http://127.0.0.1:8500/v1/session/create?wait=5000ms\": dial tcp 127.0.0.1:8500: connect: connection refused"}
2020-04-24T02:49:20.153Z        ERROR        cmd/sentinel.go:1735        error retrieving cluster data        {"error": "Unexpected response code: 500"}
2020-04-24T02:49:22.781Z        INFO        cmd/sentinel.go:94        Trying to acquire sentinels leadership
2020-04-24T02:49:22.788Z        INFO        cmd/sentinel.go:101        sentinel leadership acquired
2020-04-24T02:49:26.287Z        WARN        cmd/sentinel.go:282        received db state for unexpected db uid        {"receivedDB": "727e89a6", "db": "261d3df3", "keeper": "node_1"}
2020-04-24T02:49:26.287Z        WARN        cmd/sentinel.go:282        received db state for unexpected db uid        {"receivedDB": "43769c6e", "db": "50db7617", "keeper": "node_2"}
2020-04-24T02:49:26.287Z        WARN        cmd/sentinel.go:282        received db state for unexpected db uid        {"receivedDB": "dbc050dc", "db": "e25ff760", "keeper": "node_3"}
2020-04-24T02:49:28.356Z        WARN        cmd/sentinel.go:282        received db state for unexpected db uid        {"receivedDB": "dbc050dc", "db": "e25ff760", "keeper": "node_3"}
2020-04-24T02:49:28.369Z        INFO        cmd/sentinel.go:1234        removing failed synchronous standby        {"masterDB": "e25ff760", "db": "261d3df3"}
2020-04-24T02:49:28.369Z        INFO        cmd/sentinel.go:1234        removing failed synchronous standby        {"masterDB": "e25ff760", "db": "50db7617"}
2020-04-24T02:49:28.369Z        INFO        cmd/sentinel.go:1292        adding previous synchronous standby to reach MinSynchronousStandbys        {"masterDB": "e25ff760", "synchronousStandbyDB": "261d3df3", "keeper": "node_1"}
2020-04-24T02:49:28.369Z        INFO        cmd/sentinel.go:1326        synchronousStandbys changed        {"masterDB": "e25ff760", "prevSynchronousStandbys": {"261d3df3":{},"50db7617":{}}, "synchronousStandbys": {"261d3df3":{}}}
2020-04-24T02:49:32.461Z        INFO        cmd/sentinel.go:441        followed instance timeline forked before our current state        {"followedTimeline": 99, "followedXlogpos": 187987722016, "timeline": 98, "xlogpos": 187987722344}
2020-04-24T02:49:32.461Z        INFO        cmd/sentinel.go:1161        removing invalid db        {"db": "261d3df3", "keeper": "node_1"}
2020-04-24T02:49:32.461Z        INFO        cmd/sentinel.go:1222        removing non existent db from synchronousStandbys        {"masterDB": "e25ff760", "db": "261d3df3"}
2020-04-24T02:49:32.461Z        INFO        cmd/sentinel.go:1326        synchronousStandbys changed        {"masterDB": "e25ff760", "prevSynchronousStandbys": {"261d3df3":{}}, "synchronousStandbys": {}}
2020-04-24T02:49:32.461Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:32.461Z        INFO        cmd/sentinel.go:1428        added new standby db        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:49:36.549Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:38.603Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:40.629Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:42.655Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:44.675Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:46.697Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:48.726Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:50.756Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:52.781Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:54.809Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:56.845Z        INFO        cmd/sentinel.go:1333        using a fake synchronous standby since there are not enough real standbys available        {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:58.904Z        INFO        cmd/sentinel.go:1272        adding new synchronous standby in good state trying to reach MaxSynchronousStandbys        {"masterDB": "e25ff760", "synchronousStandbyDB": "941861ea", "keeper": "node_1"}
2020-04-24T02:49:58.904Z        INFO        cmd/sentinel.go:1315        merging current and previous synchronous standbys        {"masterDB": "e25ff760", "prevSynchronousStandbys": {}, "synchronousStandbys": {"941861ea":{}}}
2020-04-24T02:49:58.904Z        INFO        cmd/sentinel.go:1326        synchronousStandbys changed        {"masterDB": "e25ff760", "prevSynchronousStandbys": {}, "synchronousStandbys": {"941861ea":{}}}
2020-04-24T02:51:16.114Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:18.168Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:20.222Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:22.272Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:25.171Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:25.171Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "e25ff760", "keeper": "node_3"}
2020-04-24T02:51:25.176Z        INFO        cmd/sentinel.go:1234        removing failed synchronous standby        {"masterDB": "e25ff760", "db": "941861ea"}
2020-04-24T02:51:25.176Z        INFO        cmd/sentinel.go:1292        adding previous synchronous standby to reach MinSynchronousStandbys        {"masterDB": "e25ff760", "synchronousStandbyDB": "941861ea", "keeper": "node_1"}
2020-04-24T02:55:53.324Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:55:55.357Z        ERROR        cmd/sentinel.go:1786        failed to get proxies info        {"error": "unexpected end of JSON input"}
2020-04-24T02:56:33.918Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:56:35.976Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:04.337Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:06.413Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:08.461Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:10.511Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:12.564Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:12.568Z        INFO        cmd/sentinel.go:1234        removing failed synchronous standby        {"masterDB": "e25ff760", "db": "941861ea"}
2020-04-24T02:58:12.568Z        INFO        cmd/sentinel.go:1292        adding previous synchronous standby to reach MinSynchronousStandbys        {"masterDB": "e25ff760", "synchronousStandbyDB": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:14.595Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:14.599Z        INFO        cmd/sentinel.go:1234        removing failed synchronous standby        {"masterDB": "e25ff760", "db": "941861ea"}
2020-04-24T02:58:14.599Z        INFO        cmd/sentinel.go:1292        adding previous synchronous standby to reach MinSynchronousStandbys        {"masterDB": "e25ff760", "synchronousStandbyDB": "941861ea", "keeper": "node_1"}

sorintlab / stolon