Open nh2 opened 4 years ago
As a side note, I also have some requests that might have made the immediate investigation faster for us:
stolonctl status
show it when a node knows it has an issue? That does not seem the case currently.stolonctl status
show the DB UUIDs next to the keepers?stolonctl status
show the host name next to UUIDs?From the 3 attached sentinel logs, I believe this excerpt from node-3 might be most relevant (but you can judge that better):
2020-04-24T02:49:12.780Z ERROR cmd/sentinel.go:114 election loop error {"error": "Put \"http://127.0.0.1:8500/v1/session/create?wait=5000ms\": dial tcp 127.0.0.1:8500: connect: connection refused"}
2020-04-24T02:49:20.153Z ERROR cmd/sentinel.go:1735 error retrieving cluster data {"error": "Unexpected response code: 500"}
2020-04-24T02:49:22.781Z INFO cmd/sentinel.go:94 Trying to acquire sentinels leadership
2020-04-24T02:49:22.788Z INFO cmd/sentinel.go:101 sentinel leadership acquired
2020-04-24T02:49:26.287Z WARN cmd/sentinel.go:282 received db state for unexpected db uid {"receivedDB": "727e89a6", "db": "261d3df3", "keeper": "node_1"}
2020-04-24T02:49:26.287Z WARN cmd/sentinel.go:282 received db state for unexpected db uid {"receivedDB": "43769c6e", "db": "50db7617", "keeper": "node_2"}
2020-04-24T02:49:26.287Z WARN cmd/sentinel.go:282 received db state for unexpected db uid {"receivedDB": "dbc050dc", "db": "e25ff760", "keeper": "node_3"}
2020-04-24T02:49:28.356Z WARN cmd/sentinel.go:282 received db state for unexpected db uid {"receivedDB": "dbc050dc", "db": "e25ff760", "keeper": "node_3"}
2020-04-24T02:49:28.369Z INFO cmd/sentinel.go:1234 removing failed synchronous standby {"masterDB": "e25ff760", "db": "261d3df3"}
2020-04-24T02:49:28.369Z INFO cmd/sentinel.go:1234 removing failed synchronous standby {"masterDB": "e25ff760", "db": "50db7617"}
2020-04-24T02:49:28.369Z INFO cmd/sentinel.go:1292 adding previous synchronous standby to reach MinSynchronousStandbys {"masterDB": "e25ff760", "synchronousStandbyDB": "261d3df3", "keeper": "node_1"}
2020-04-24T02:49:28.369Z INFO cmd/sentinel.go:1326 synchronousStandbys changed {"masterDB": "e25ff760", "prevSynchronousStandbys": {"261d3df3":{},"50db7617":{}}, "synchronousStandbys": {"261d3df3":{}}}
2020-04-24T02:49:32.461Z INFO cmd/sentinel.go:441 followed instance timeline forked before our current state {"followedTimeline": 99, "followedXlogpos": 187987722016, "timeline": 98, "xlogpos": 187987722344}
2020-04-24T02:49:32.461Z INFO cmd/sentinel.go:1161 removing invalid db {"db": "261d3df3", "keeper": "node_1"}
2020-04-24T02:49:32.461Z INFO cmd/sentinel.go:1222 removing non existent db from synchronousStandbys {"masterDB": "e25ff760", "db": "261d3df3"}
2020-04-24T02:49:32.461Z INFO cmd/sentinel.go:1326 synchronousStandbys changed {"masterDB": "e25ff760", "prevSynchronousStandbys": {"261d3df3":{}}, "synchronousStandbys": {}}
2020-04-24T02:49:32.461Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:32.461Z INFO cmd/sentinel.go:1428 added new standby db {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:49:36.549Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:38.603Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:40.629Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:42.655Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:44.675Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:46.697Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:48.726Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:50.756Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:52.781Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:54.809Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:56.845Z INFO cmd/sentinel.go:1333 using a fake synchronous standby since there are not enough real standbys available {"masterDB": "e25ff760", "required": 1}
2020-04-24T02:49:58.904Z INFO cmd/sentinel.go:1272 adding new synchronous standby in good state trying to reach MaxSynchronousStandbys {"masterDB": "e25ff760", "synchronousStandbyDB": "941861ea", "keeper": "node_1"}
2020-04-24T02:49:58.904Z INFO cmd/sentinel.go:1315 merging current and previous synchronous standbys {"masterDB": "e25ff760", "prevSynchronousStandbys": {}, "synchronousStandbys": {"941861ea":{}}}
2020-04-24T02:49:58.904Z INFO cmd/sentinel.go:1326 synchronousStandbys changed {"masterDB": "e25ff760", "prevSynchronousStandbys": {}, "synchronousStandbys": {"941861ea":{}}}
2020-04-24T02:51:16.114Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:18.168Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:20.222Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:22.272Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:25.171Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:51:25.171Z WARN cmd/sentinel.go:271 no keeper info available {"db": "e25ff760", "keeper": "node_3"}
2020-04-24T02:51:25.176Z INFO cmd/sentinel.go:1234 removing failed synchronous standby {"masterDB": "e25ff760", "db": "941861ea"}
2020-04-24T02:51:25.176Z INFO cmd/sentinel.go:1292 adding previous synchronous standby to reach MinSynchronousStandbys {"masterDB": "e25ff760", "synchronousStandbyDB": "941861ea", "keeper": "node_1"}
2020-04-24T02:55:53.324Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:55:55.357Z ERROR cmd/sentinel.go:1786 failed to get proxies info {"error": "unexpected end of JSON input"}
2020-04-24T02:56:33.918Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:56:35.976Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:04.337Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:06.413Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:08.461Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:10.511Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:12.564Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:12.568Z INFO cmd/sentinel.go:1234 removing failed synchronous standby {"masterDB": "e25ff760", "db": "941861ea"}
2020-04-24T02:58:12.568Z INFO cmd/sentinel.go:1292 adding previous synchronous standby to reach MinSynchronousStandbys {"masterDB": "e25ff760", "synchronousStandbyDB": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:14.595Z WARN cmd/sentinel.go:271 no keeper info available {"db": "941861ea", "keeper": "node_1"}
2020-04-24T02:58:14.599Z INFO cmd/sentinel.go:1234 removing failed synchronous standby {"masterDB": "e25ff760", "db": "941861ea"}
2020-04-24T02:58:14.599Z INFO cmd/sentinel.go:1292 adding previous synchronous standby to reach MinSynchronousStandbys {"masterDB": "e25ff760", "synchronousStandbyDB": "941861ea", "keeper": "node_1"}
On our production cluster of 3 DBs with sync replication, running Stolon 0.10 with cherry-picked workarounds for #494 (we are preparing to upgrade to latest Stolon but need to be very careful with it), according to my current state of knowledge:
What happened:
The cluster degenerated into a state where a master was written to and the other two nodes did not follow it, for a week. Afer that time, one of the other DBs got promoted to master carrying 6-days old data, so a lot of things our customers had created since then disappeared.
Mitigation:
We managed to avoid any harm by quickly identifying the problem, and restoring the data from the underlying postgres DB that still had the newer data, eventually ensuring that no customer data was lost.
We wrote a portmortem for it and sharing the Stolon details here, in the hope that we can determine underlying issue and what can be done against it.
My analysis so far
From the
stolon-keeper
logs, I identified 2 relevant events in time.In the first event, one machine becomes the sole master of the cluster with all replication to the other nodes stopping. In the second event, another machine with stale data becomes master.
The 3 machines involved are called
node-1
,node-2
andnode-3
. The consul errors mentioned are due to a short disruption of network connectivity that Stolon should likely be able to handle. I write below a short summary of what I concluded from the logs (log files attached to verify).2020-04-24T02:49:00.625Z
43769c6e
(node_2) -> consul error, rewind, diverged WAL, followinge25ff760
(node_3){"db": "43769c6e", "cdDB": "50db7617"}
-> db stays down43769c6e
(node_2) -> consul error, starting DB -> masterneeded synchronous_standby_names changed {"prevSyncStandbyNames": "stolon_261d3df3", "syncStandbyNames": "stolonfakestandby"}
2020-04-30T14:54:02.021Z
e25ff760
(node_3) ->db UID different
-> following43769c6e
(node_2) -> connection parameters changed, restarting DB -> following43769c6e
(node_2)dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory
since Event 1 ->db UID different
, starting DB -> masterneeded synchronous_standby_names changed {"prevSyncStandbyNames": "2 (stolon_727e89a6,stolon_dbc050dc)", "syncStandbyNames": "stolon_dbc050dc"}
(dbc050dc
is node_3)db UID different
->cannot move from master role to standby role
(to follow43769c6e
, node_2)Relevant log extracts
event-1-node-1-keeper.txt event-1-node-1-proxy.txt event-1-node-2-keeper.txt event-1-node-2-proxy.txt event-1-node-3-keeper.txt event-1-node-3-proxy.txt event-2-node-1-keeper.txt event-2-node-1-proxy.txt event-2-node-2-keeper.txt event-2-node-2-proxy.txt event-2-node-3-keeper.txt event-2-node-3-proxy.txt node-1-sentinel.txt node-2-sentinel.txt node-3-sentinel.txt
Summary
node-3
became master with no standbys mirroring it (syncStandbyNames
==stolonfakestandby
).node-2
was down from here,node-1
claimed to follow the masternode-3
node-2
starts its DB again and becomes master (with old contents!).node-1
follows it.node-3
is told to follow it but cannot (which is what allowed us to rescue the data).Important questions
node-2
become master even though it had old contents?Thank you!