hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.09k stars 114 forks source link

Consider providing pluggable pg_basebackup alternative #832

Closed redbaron closed 2 years ago

redbaron commented 2 years ago

It would be great if pg_basebackup was replaceable. Currently only way to bring up new follower is pg_basebackup, which works but other alternatives exists. Restoring from a backup and WAL archive is a good way to setup new follower node fast and without causing stress to the primary, but there seems to be no way of executing it at the pg_auto_failover state transition.

Searching through the issues it seems that generic hooks idea was shot down, which is a shame, but I can understand the reason why. Are you open to the idea of having a hook just to prepare PGDATA on a non master node, whether it is new or a former master being readded to the cluster? In my case, ideally it should be run after pg_rewind fails, but moving everything PGDATA related to the hook and providing a default hook with pg_rewind followed by pg_basebackup is fine as well.

DimCitus commented 2 years ago

Hi @redbaron ; thanks for your interest in pg_auto_failover. About the first time init of a node, you can populate PGDATA any way you want and then register the node using the pg_autoctl create postgres command. See https://pg-auto-failover.readthedocs.io/en/master/ref/pg_autoctl_create_postgres.html#description for more information about that approach.

Then in the case of a failover when pg_rewind fails, we need integration with WAL archiving. And we could go the hooks/plugin route, but I believe we should implement a proper WAL archiving architecture and make it easy to have that setup in pg_auto_failover. See https://github.com/citusdata/pg_auto_failover/pull/819 where I have started a WAL-G integration. The first PR of the series is only about WAL files, next-up in a follow-up PR we will see about base backups and using WAL-G to fetch the whole PGDATA.

I'm leaning towards closing this issue now, because it's basically one case supported, the general approach is a Work-In-Progress. Feel free to consider re-opening with details if you feel differently of course.