Proposal: add pgdata in pgautofailover.node

gfphoenix78 commented 3 years ago

Context: The monitor has some basic metadata about the nodes registered to the monitor. However, we can't get the pgdata of a specific node from the monitor, so we have to explicitly pass the pgdata to start/stop/remove the node. When start the cluster, we have to pass the pgdata to start the primary and secondary and we start the primary before the secondary. The problem is that the monitor has records who is the primary, but we also have to pass the pgdata of the primary to start the cluster, which is weird.

Proposal: The relation pgautofailover.node adds a new field pgdata to record the pgdata of the PG instance. Now, the workflow to start the cluster becomes:

start the monitor
get the primary and start it
start the remaining secondaries.

Does it make sense?

DimCitus commented 3 years ago

Hi @gfphoenix78 ; thanks for the proposal. I only can imagine your proposal making sense right now when all the Postgres instances are running on the same host, where you can start all processes locally, and can use PGDATA of any node as a local directory.

I fail to see how that makes sense when each Postgres instance is running on a separate node. In particular your step 2... where should that be running from? do you want to start the primary from the monitor node itself?

gfphoenix78 commented 3 years ago

@DimCitus Thank you for your reply. Let me describe more details.

In our cluster, the PG instances are deployed in separate hosts. Each host has the SSH server and they trust each other. So, they can run commands remotely via ssh. The workflow becomes:

Start the monitor instance by inputting the host and pgdata of the monitor.
Connect to the monitor by inputting the db URL to get the host and pgdata of the primary.
Start the primary remotely by running ssh dba@primary_host pg_autoctl run --pgdata primary_pgdata
Wait for the primary is ready.
Start the secondary the same way as the primary.

In short, we may start the cluster by only providing the host, port and pgdata of the monitor. The monitor takes care of the real/active primary when the promotion happened.

DimCitus commented 3 years ago

Thanks for adding the details. I see, you're writing your own orchestration mechanism and you're using SSH to communicate between nodes and start remote commands. While that might be a good way to implement your infrastructure, this is out of scope for pg_auto_failover: we only use the Postgres protocol and the Postgres tools. No need to open ssh in between nodes for pg_auto_failover to work.

soumyadeep2007 commented 3 years ago

HI @DimCitus,

I'm a little late to the party.

Capturing the pgdata directory for each node in any autofailover cluster does have the benefit of being able to bring up a data node from the monitor itself, or from any other node (after querying the monitor to get all the relevant information about the data node) without having to query the data node in a non-standard way

Currently, there is no central repository for this information - a pg_autofailover operator would have to rely on some form of non-standard storage for that information on the data node itself and a facility to query it. That may or may not be easy in certain environments (e.g. this might be easy in a k8s like environment where we would have a facility to grab PGDATA's value from a pod's environment).

Since the pgdata directory is something that must be provided at node registration time, it would be easy to store this information. There is certainly no chance that the value can go out of touch with reality (as a node's pgdata can't change)

I think having this info in the node table and having it exposed via pg_autoctl state commands, would go one step towards simplifying operations. It will give an operator the complete coordinates of a data node, especially if it is not live at the moment: (nodeid, nodehost, nodeport, pgdata). The coordinates can be useful at any time for running other cluster management / observability operations on a remote node's PGDATA.

gfphoenix78 commented 3 years ago

Hi @soumyadeep2007 , exposing pgdata via pg_autoctl status commands seems strange, because pg_autoctl requires pgdata as its input argument, is it correct?

DimCitus commented 3 years ago

Hi, yes @gfphoenix78 this is correct. At the moment the whole design of pg_autoctl follows the design of Postgres itself on that stand: you must know PGDATA to use any Postgres server-side command, and the rest is found within PGDATA.

At the moment for instance pg_autoctl uses PGDATA to compute the path of the pg_autoctl configuration file and state file, and reads the monitor URI in the configuration file, and reads the node id and group id in the state file. How are you proposing that we find/manage node/group ids to be able to talk to the monitor and retrieve PGDATA? What if the monitor is down, can we proceed and run Postgres locally anyway with pg_autoctl?

gfphoenix78 commented 3 years ago

How are you proposing that we find/manage node/group ids to be able to talk to the monitor and retrieve PGDATA? What if the monitor is down, can we proceed and run Postgres locally anyway with pg_autoctl?

Starting the cluster from the monitor is a preferred way. If the monitor is down, the workflow above raises an error and we need to explicitly tell the startup script not to query info from its monitor until the monitor is up or reset.

DimCitus commented 3 years ago

Starting the cluster from the monitor is a preferred way. If the monitor is down, the workflow above raises an error and we need to explicitly tell the startup script not to query info from its monitor until the monitor is up or reset.

That proposal looks like a step backwards to me. If we add service dependencies we end up with more cases when nothing works compared to when properly handling errors and having a retry loop and other mechanisms to make our approach robust, as we have now.

I'm sorry but I just don't understand how handling nodes PGDATA on the monitor would help improve anything we have. And I understand that you want a central way to manage PGDATA but for this to be handled by pg_auto_failover means a complete re-design of an important part of the tool. I don't see the benefits of doing that as worth the efforts there. If you think the effort is worthwhile, please be my guest and try it. Then we will see the result and might have a chat about whether this should be the new way for pg_auto_failover.

soumyadeep2007 commented 3 years ago

Hi @soumyadeep2007 , exposing pgdata via pg_autoctl status commands seems strange, because pg_autoctl requires pgdata as its input argument, is it correct?

Yes for this scheme to be successful, all the operator (human/program) would need to know is the monitor's pgdata.

At the moment for instance pg_autoctl uses PGDATA to compute the path of the pg_autoctl configuration file and state file, and reads the monitor URI in the configuration file, and reads the node id and group id in the state file. How are you proposing that we find/manage node/group ids to be able to talk to the monitor and retrieve PGDATA? What if the monitor is down, can we proceed and run Postgres locally anyway with pg_autoctl?

With the knowledge of what the monitor's pgdata is, we can get the pgdata dirs of the entire cluster (with the state command). Then the operator would be able to parse the output (lets assume json for convenience) -- for eg. by looking at the output the operator can determine which node was the primary and can start up that node first.

As for the monitor being down, IIRC, today, if the monitor is down and we execute pg_autoctl run on a data node, won't the data node complain that the monitor isn't up, and thus fail to start?

To be clear, capturing the pgdata at registration time and exposing the pgdata like other things such as hostname, port etc is the only change necessary to pg_auto_failover. I don't think there is any other management involved in @gfphoenix78 's proposal. @gfphoenix78 please correct me if I'm wrong.

gfphoenix78 commented 3 years ago

@soumyadeep2007

Yes for this scheme to be successful, all the operator (human/program) would need to know is the monitor's pgdata.

The pgdata is the data directory of the node where pg_autoctl operates on, not always the monitor's pgdata.

As for the monitor being down, IIRC, today, if the monitor is down and we execute pg_autoctl run on a data node, won't the data node complain that the monitor isn't up, and thus fail to start?

Yes, when the monitor is down, the node can't be up via pg_autoctl run.

Exposing the pgdata like hostname, port just give us more detailed info about the node, so we can have flexible ways to manage the cluster.

DimCitus commented 3 years ago

Exposing the pgdata like hostname, port just give us more detailed info about the node, so we can have flexible ways to manage the cluster.

We don't expose --pghost nor --listen either. I still fail to understand how you can use a remote PGDATA at a distance. Well I see that it needs a full non-interactive ssh setup in place, and I don't want pg_auto_failover to depend on that. At the moment pg_auto_failover just works with the Postgres protocol, there is no need to setup ssh in between pg_auto_failover nodes, and I want to continue this way.

If all you need to script around is exposing per-node PGDATA, we could of course accept your PR that implements that, I don't see any harm in doing so. Review the --pghost and --listen arguments too, and maybe even --username while at it.

gfphoenix78 commented 3 years ago

@DimCitus Yes, all I need is to exposing per-node PGDATA, to allow the flexible startup scripts.

One more question. How do you determine the startup sequence of the primary and secondary nodes? If the secondary starts up earlier, does it matter to promote one of the secondaries just because the primary is slower to start?

hapostgres / pg_auto_failover

Proposal: add pgdata in pgautofailover.node #569