Support for switchover (controlled failover)

kagkarlsson commented 7 years ago

As far as I can see, stolon does not support a switchover operation, i.e. a controlled failover where there are strong guarantees that no data will be lost, even if using asynchronous replication. It could for example be used to do database minor-version updates or OS-updates.

Are there any plans for supporting this in the future?

theothermike commented 7 years ago

Agreed, this would be a great feature to enable upgrading of hardware and software other than 'kubectl delete pod' or however you'd force it now

sgotti commented 7 years ago

@kagkarlsson the main question is how to do this.

Now it works with synchronous replication because it's one of its guarantees. There's no guarantee that data will be lost with asynchronous replication. To achieve this you have to stop every write transaction on the master and be sure that the standby you are switching to is synced. For doing this, assuming that all the transactions will flow through the stolon proxy, you should shutdown the proxies (or tell them to not accept new connection and close the current ones).

This could be implemented (and we'll happily accept and review PR).

But, instead of doing all of this, that will cause also more downtime, why not just temporarily enabling synchronous replication and stop the current master? This should already be possible right now: https://github.com/sorintlab/stolon/blob/master/doc/syncrepl.md. Thoughts?

albertvaka commented 7 years ago

@sgotti That's a nice solution! Maybe it should be documented somewhere as the recommended procedure until it can be automatically done with a command.

sgotti commented 7 years ago

@albertvaka Would you like to create a PR to document this?

kagkarlsson commented 7 years ago

This sounds like an acceptable solution @sgotti . When posting the issue I was not aware of that possibility in Postgres. Is current transactions also made synchronous after reload? Or will it only affect new transactions?

sgotti commented 7 years ago

Is current transactions also made synchronous after reload? Or will it only affect new transactions?

Yeah, the synchronous replication happens when issuing a commit. For example, if you don't have synchronous replication enabled, BEGIN a new transaction, do some work, then enable synchronous replication (wait for the master to have configured and applied its synchronous_standby_names parameter) and execute the COMMIT it'll be a synchronous commit. I just tried it right now with postgres 9.6.

(BTW, while connections are usually pooled and so long living, transaction should be short lived).

If you are interested on how postgres sync repl works take a look at its docs: https://www.postgresql.org/docs/9.6/static/warm-standby.html#SYNCHRONOUS-REPLICATION

and be aware of the explained limitations since the sync repl is optained just firstly commiting the transaction to the master wal and then waiting for the wal being replicated and applied by the standby. For example:

a client can explicitly disable synchronous commit (setting synchronous_commit to local or off) for that specific connection or transaction, so you should be sure that you app aren't doing this or doing this only for transaction you can lose.
from the doc: If primary restarts while commits are waiting for acknowledgement, those waiting transactions will be marked fully committed once the primary database recovers. There is no way to be certain that all standbys have received all outstanding WAL data at time of the crash of the primary. Some transactions may not show as committed on the standby, even though they show as committed on the primary. The guarantee we offer is that the application will not receive explicit acknowledgement of the successful commit of a transaction until the WAL data is known to be safely received by all the synchronous standbys.

kagkarlsson commented 7 years ago

When they say

If primary restarts while commits are waiting for acknowledgement, those waiting transactions will be marked fully committed once the primary database recovers.

I suppose they are talking about a abrupt restart of the primary, not a graceful shutdown which would be done in a switchover scenario. I assume the primary wait for acknowledgements from standbys before shutting completely down..

c2h5oh commented 7 years ago

I need this ASAP, so I'm working on implementing it. I've opened a pull request for the underlying proxy library to add missing logic https://github.com/sorintlab/pollon/pull/6

It's going to be roughly an equivalent of PgBouncer session pool mode:

pause: accept incoming connections, but don't connect to destination
optionally close connections opened before pause after timeout
update configuration
resume: start opening connections to destination again

As long as you keep connection close timeout longer than max connection lifetime on the client it should be completely seamless.

Stretch goal: option to wait for replication lag to drop to 0 before resume.

sgotti commented 7 years ago

@c2h5oh glad you want to work on this. IMHO I still think that the solution here (temporarily enable sync repl) is one of the safer and faster. Is there something in that solution that won't work for you?

About your idea:

pause: accept incoming connections, but don't connect to destination optionally close connections opened before pause after timeout

Switching master will always forcibly close connection to old master.

update configuration resume: start opening connections to destination again

Not sure why you want to pause and resume. Why you can't just stop accepting like done now when a new master is elected? The client will just receive a connection refused until the proxy is instructed with a new master. Accepting a connection and waiting to proxy it to the db instance will make client wait instead of receiving a connection refused but I see it as a bad behavior (it's the client that have to handle connection errors and we shouldn't fake it) and this will introduce two different proxy behaviors.

c2h5oh commented 7 years ago

I'm essentially replicating PgBouncer functionality.

Switching master will always forcibly close connection to old master.

The changes (backwards compatible - I only added functions, see pull request) I've made to pollon give you an option let those connection finish. With optional timeout to put a cap on how long this is going to take and to handle clients with lazy disconnect for pooled connections (I'm looking at you golang's database/sql).

Not sure why you want to pause and resume. Why you can't just stop accepting like done now when a new master is elected? The client will just receive a connection refused until the proxy is instructed with a new master.

My goal is to provide a switchover that, when it works right, is completely transparent to clients and externally manifests as nothing more than slower connect time - with sufficiently low master election time it allows to continue as if nothing happened even with dumb clients lacking proper connection error recovery.

One could also argue that in many situations this might be preferred option vs hammering the proxy with reconnect attempts - client is not cluster status aware so it will just blindly retry; stolon proxy on the other hand can let the connections through as soon as new master is elected.

Accepting a connection and waiting to proxy it to the db instance will make client wait instead of receiving a connection refused but I see it as a bad behavior (it's the client that have to handle connection errors and we shouldn't fake it) and this will introduce two different proxy behaviors.

Like with my pollon pull request I have no intention of changing the default behaviour - I'll be making it optional.

sgotti commented 7 years ago

I'm essentially replicating PgBouncer functionality.

I'm not sure it makes sense to replicate that functionality, PgBouncer has a very different purpose than the stolon proxy since it acts as a layer 7 proxy that talks the postgresql protocol and it's used as an intermediate connection pooling for non connection pooled clients, while the stolon proxy is just a layer4 proxy born with the intent to make client go to the right master (avoiding client connecting/keeping connections to an old partitioned master).

When a new master is elected the sentinel will tell the proxies to close all the connections and stop accepting new ones and when the new master reports as promoted the sentinel will tell the proxies to accept connections and proxies them to the new master.

Keeping a connection to an old master open for some time while we already have a new master promoted will cause very bad behaviors, for example:

A client using a connection pool (like golang, a JEE application server etc...) will execute transaction on the old master and if the master is still up (because it has not yet read its new dbspec written by the sentinel or it cannot read the store because partitioned or any other reason) some transactions will be executed on it causing lost transactions.

My goal is to provide a switchover that, when it works right, is completely transparent to clients and externally manifests as nothing more than slower connect time - with sufficiently low master election time it allows to continue as if nothing happened even with dumb clients lacking proper connection error recovery.

When a new master is elected there's the need to immediately close connections to the old one, so there's not way to make it transparent, the connections to the old master MUST be closed.

Keeping it open, as explained above, will defeat one of the main purpose of stolon: data consistency.

So it cannot be transparent, the client needs to reconnect and the proxy will proxy that new connections to the new master.

But, also making the stolon proxy a layer 7 proxy that talks the postgresql protocol decoupling clients from the db (adding a lot of logic behind, like closing prepared statement, saving and cleaning session data etc...) will not solve the problem. You'll always have to close connections to the old masters so nothing will really change and make it transparent.

IMHO I think the current behavior is the unique possible independently from how a client behaves:

If a transaction is in flight while a master is changed it will return an error (good and no way to avoid this). This will happen with both connection pooled or non connection pooled clients.
With a connection pooled client, If the pooled connection is free, it will be closed and the connection pool/driver (depending on the implementation) will detect the broken connection and don't reuse it.

c2h5oh commented 7 years ago

When a new master is elected there's the need to immediately close connections to the old one, so there's not way to make it transparent, the connections to the old master MUST be closed.

For manual switchover this is not mutually exclusive:

Pause connections with timeout (give connections 1-2s to finish)
Enable sync replication
Wait for both replication lag to get to 0 and connection timeout
Elect new master
Resume

But, also making the stolon proxy a layer 7 proxy that talks the postgresql protocol decoupling clients from the db (adding a lot of logic behind, like closing prepared statement, saving and cleaning session data etc...) will not solve the problem. You'll always have to close connections to the old masters so nothing will really change and make it transparent.

This is actually unrelated. Prepared statement cleanup is only performed when client is disconnected, but PgBouncer keeps the connection alive in the pool for future reuse.

sgotti commented 7 years ago

For manual switchover this is not mutually exclusive:

Pause connections with timeout (give connections 1-2s to finish) Enable sync replication Wait for both replication lag to get to 0 and connection timeout Elect new master Resume

Not sure If I get the complete idea before looking at the implementation. Sorry if I miss this before but now it looks like automating this with the addition of a small pause (can't completely understand how this could be useful with pooled clients) and the proxy accepting incoming connection but keeping them queued until the new master is ready.

Given the declarative nature of the cluster specification I'm not sure how to declare this "switchover" operation inside it. In addition it'll become quite complicated to manage all the states change of this workflow and recover from all the possible errors.

I'll better see this as something on top of the sentinel logic like an external tool that will orchestrate this workflow (that could become a stolonctl command).

c2h5oh commented 7 years ago

Not sure If I get the complete idea before looking at the implementation. Sorry if I miss this before but now it looks like automating this with the addition of a small pause (can't completely understand how this could be useful with pooled clients) and the proxy accepting incoming connection but keeping them queued until the new master is ready.

Yes, that's exactly it.

Given the declarative nature of the cluster specification I'm not sure how to declare this "switchover" operation inside it. In addition it'll become quite complicated to manage all the states change of this workflow and recover from all the possible errors.

Step 1: Declare cluster paused - all proxies should start queuing new connections Step 2a: Elect new master from sync replicas if there is one Step 2b: If there are no sync replicas make one sync and once replication lag is 0 elect it new master Step 3: Declare cluster active - all proxies should start connecting queued connections to new master

To answer how is this useful to clients:

Not everything connecting to cluster has good reconnect logic - some things just fail. If it's something you develop in-house you can just fix it, but you aren't always that lucky.
Even if a piece of software can correctly do reconnects it will attempt to restore connection without being aware of the state of the cluster - it will simply retry (with exponential backoff if you're lucky). Proxy is cluster state aware, so it can restore connection as soon as it is possible.
If high number of clients fail at the same time and all use the same reconnect logic they will hit the DB roughly at the same time and with postgres connections being expensive (fork/clone) that's a recipe for thundering herd problem
Cluster becomes a de facto centralised reconnect logic source of truth.

sorintlab / stolon

Support for switchover (controlled failover) #235