Closed gfphoenix78 closed 3 years ago
Hi @gfphoenix78 ; thanks for the proposal. I only can imagine your proposal making sense right now when all the Postgres instances are running on the same host, where you can start all processes locally, and can use PGDATA of any node as a local directory.
I fail to see how that makes sense when each Postgres instance is running on a separate node. In particular your step 2... where should that be running from? do you want to start the primary from the monitor node itself?
@DimCitus Thank you for your reply. Let me describe more details.
In our cluster, the PG instances are deployed in separate hosts. Each host has the SSH server and they trust each other. So, they can run commands remotely via ssh. The workflow becomes:
pgdata
of the monitor.host
and pgdata
of the primary.ssh dba@primary_host pg_autoctl run --pgdata primary_pgdata
In short, we may start the cluster by only providing the host
, port
and pgdata
of the monitor. The monitor takes care of the real/active primary when the promotion happened.
Thanks for adding the details. I see, you're writing your own orchestration mechanism and you're using SSH to communicate between nodes and start remote commands. While that might be a good way to implement your infrastructure, this is out of scope for pg_auto_failover: we only use the Postgres protocol and the Postgres tools. No need to open ssh in between nodes for pg_auto_failover to work.
HI @DimCitus,
I'm a little late to the party.
Capturing the pgdata
directory for each node in any autofailover cluster does have the benefit of being able to bring up a data node from the monitor itself, or from any other node (after querying the monitor to get all the relevant information about the data node) without having to query the data node in a non-standard way
Currently, there is no central repository for this information - a pg_autofailover operator would have to rely on some form of non-standard storage for that information on the data node itself and a facility to query it. That may or may not be easy in certain environments (e.g. this might be easy in a k8s like environment where we would have a facility to grab PGDATA's value from a pod's environment).
Since the pgdata
directory is something that must be provided at node registration time, it would be easy to store this information. There is certainly no chance that the value can go out of touch with reality (as a node's pgdata can't change)
I think having this info in the node
table and having it exposed via pg_autoctl state commands, would go one step towards simplifying operations. It will give an operator the complete coordinates of a data node, especially if it is not live at the moment: (nodeid, nodehost, nodeport, pgdata)
. The coordinates can be useful at any time for running other cluster management / observability operations on a remote node's PGDATA.
Hi @soumyadeep2007 , exposing pgdata
via pg_autoctl status commands seems strange, because pg_autoctl requires pgdata
as its input argument, is it correct?
Hi, yes @gfphoenix78 this is correct. At the moment the whole design of pg_autoctl
follows the design of Postgres itself on that stand: you must know PGDATA to use any Postgres server-side command, and the rest is found within PGDATA.
At the moment for instance pg_autoctl
uses PGDATA to compute the path of the pg_autoctl configuration file and state file, and reads the monitor URI in the configuration file, and reads the node id and group id in the state file. How are you proposing that we find/manage node/group ids to be able to talk to the monitor and retrieve PGDATA? What if the monitor is down, can we proceed and run Postgres locally anyway with pg_autoctl
?
How are you proposing that we find/manage node/group ids to be able to talk to the monitor and retrieve PGDATA? What if the monitor is down, can we proceed and run Postgres locally anyway with pg_autoctl?
Starting the cluster from the monitor is a preferred way. If the monitor is down, the workflow above raises an error and we need to explicitly tell the startup script not to query info from its monitor until the monitor is up or reset.
Starting the cluster from the monitor is a preferred way. If the monitor is down, the workflow above raises an error and we need to explicitly tell the startup script not to query info from its monitor until the monitor is up or reset.
That proposal looks like a step backwards to me. If we add service dependencies we end up with more cases when nothing works compared to when properly handling errors and having a retry loop and other mechanisms to make our approach robust, as we have now.
I'm sorry but I just don't understand how handling nodes PGDATA on the monitor would help improve anything we have. And I understand that you want a central way to manage PGDATA but for this to be handled by pg_auto_failover means a complete re-design of an important part of the tool. I don't see the benefits of doing that as worth the efforts there. If you think the effort is worthwhile, please be my guest and try it. Then we will see the result and might have a chat about whether this should be the new way for pg_auto_failover.
Hi @soumyadeep2007 , exposing pgdata via pg_autoctl status commands seems strange, because pg_autoctl requires pgdata as its input argument, is it correct?
Yes for this scheme to be successful, all the operator (human/program) would need to know is the monitor's pgdata.
At the moment for instance pg_autoctl uses PGDATA to compute the path of the pg_autoctl configuration file and state file, and reads the monitor URI in the configuration file, and reads the node id and group id in the state file. How are you proposing that we find/manage node/group ids to be able to talk to the monitor and retrieve PGDATA? What if the monitor is down, can we proceed and run Postgres locally anyway with pg_autoctl?
With the knowledge of what the monitor's pgdata is, we can get the pgdata dirs of the entire cluster (with the state command). Then the operator would be able to parse the output (lets assume json for convenience) -- for eg. by looking at the output the operator can determine which node was the primary and can start up that node first.
As for the monitor being down, IIRC, today, if the monitor is down and we execute pg_autoctl run
on a data node, won't the data node complain that the monitor isn't up, and thus fail to start?
To be clear, capturing the pgdata at registration time and exposing the pgdata like other things such as hostname, port etc is the only change necessary to pg_auto_failover. I don't think there is any other management involved in @gfphoenix78 's proposal. @gfphoenix78 please correct me if I'm wrong.
@soumyadeep2007
Yes for this scheme to be successful, all the operator (human/program) would need to know is the monitor's pgdata.
The pgdata is the data directory of the node where pg_autoctl operates on, not always the monitor's pgdata.
As for the monitor being down, IIRC, today, if the monitor is down and we execute pg_autoctl run on a data node, won't the data node complain that the monitor isn't up, and thus fail to start?
Yes, when the monitor is down, the node can't be up via pg_autoctl run
.
Exposing the pgdata like hostname
, port
just give us more detailed info about the node, so we can have flexible ways to manage the cluster.
Exposing the pgdata like
hostname
,port
just give us more detailed info about the node, so we can have flexible ways to manage the cluster.
We don't expose --pghost
nor --listen
either. I still fail to understand how you can use a remote PGDATA at a distance. Well I see that it needs a full non-interactive ssh setup in place, and I don't want pg_auto_failover to depend on that. At the moment pg_auto_failover just works with the Postgres protocol, there is no need to setup ssh in between pg_auto_failover nodes, and I want to continue this way.
If all you need to script around is exposing per-node PGDATA, we could of course accept your PR that implements that, I don't see any harm in doing so. Review the --pghost
and --listen
arguments too, and maybe even --username
while at it.
@DimCitus Yes, all I need is to exposing per-node PGDATA, to allow the flexible startup scripts.
One more question. How do you determine the startup sequence of the primary and secondary nodes? If the secondary starts up earlier, does it matter to promote one of the secondaries just because the primary is slower to start?
Context: The monitor has some basic metadata about the nodes registered to the monitor. However, we can't get the
pgdata
of a specific node from the monitor, so we have to explicitly pass thepgdata
to start/stop/remove the node. When start the cluster, we have to pass thepgdata
to start the primary and secondary and we start the primary before the secondary. The problem is that the monitor has records who is the primary, but we also have to pass thepgdata
of the primary to start the cluster, which is weird.Proposal: The relation
pgautofailover.node
adds a new fieldpgdata
to record the pgdata of the PG instance. Now, the workflow to start the cluster becomes:Does it make sense?