powa-collector failover to the primary repository DB server

hrawulwa commented 4 years ago

I have a remote setup, with powa-collector and powa-web running on the repository DB server, which is monitoring multiple remote DB servers. I'm planning to setup high availability for repository DB server, purpose of which is, if the primary goes down, will failover to secondary respository DB server. I can take care of the DB part. My question is, how do we achieve powa-collector also to failover to the new Primary repository DB server. Is there any service enabling function with powa-collector, which recognizes which is the Primary and run from there? Or, do I need to come up with custom script, which will check for the current Primary and launch collector process from that server? Please advise what options we have to failover collector process?

Thanks Hari

rjuju commented 4 years ago

Hello,

Well, are you talking about having powa-colelctor in high availability or having it handle / recover from a repository DB failover?

hrawulwa commented 4 years ago

The repository DB server will be in Active-passive setup, and it will be in read-write mode on only one server (Active site), When the active site goes down, the repository DB on the Passive site will become active (read-write). In this scenario, I would like powa-collector to be now running from the new Active site. Hope, I was able to answer your question.

Thanks Hari

rjuju commented 4 years ago

I'm sorry it's not entirely clear. Does it mean that you want to have powa-collector installed on the same server as the repository server (so twice), and make sure that it's only started on the active node, or should powa-collector be installed on another server?

hrawulwa commented 4 years ago

Yes, correct, powa-collector will be installed on both primary and slave repository servers. But it should be started only on the active node. Hope this clears.

Thanks Hari

rjuju commented 4 years ago

Thanks for the confirmation! Well, I guess the only way to do that depends on the HA solution you will be using. I'm assuming it has some way to handle multiple services and "resource colocation" to make sure that 2 services are always started on the same host. Which solution are you using for HA?

hrawulwa commented 4 years ago

I',m repmgr utility to configure Master-Slave configuration. When the master goes down, repmgr daemon will automatically promote the Standby as new Primary. In such scenario, I would also like powa-collector to startup on the new Primary.

Thanks Hari

rjuju commented 4 years ago

I don't know much about repmgr, but does it give an option to run some script on promotion, which you could use to also start powa-collector? But what you actually want is also to shut down the other one if it's still active.

Also I don't know how repmgr does to take care of ensuring the lack of split brain and this kind of things, but does it provide some kind extensible fencing facility to also stop additional services?

hrawulwa commented 4 years ago

Yes, repmgr calls promote.sh script which promotes Standby as Primary. So I guess I could include another command to start powa-collector in the same script, after verifying the promote was successful. I was curious to know if collector can be defined as a service, which can be configured to only run on the active node. Looks like there is no such option, and I need to rely on the script to start on the active node. May be I could use the same script to stop collector process on the passive node if it's still running. I could also just leave collector running in the Standby, as it will not be able to insert any data, since it will be Read-only database on the passive node. However, we will be seeing lot of error messages in the logs that it is unable to write. Let me know your thoughts.

Thanks Hari

rjuju commented 4 years ago

I was curious to know if collector can be defined as a service, which can be configured to only run on the active node.

This isn't a realistic option. Correctly handling that would require a lot of work, and that's the job of a HA tool, not the service itself. Your problem here is that you have a postgres-centric solution, so it's probably easier to setup as it's only handling postgres, but it can't really handle more complex setups.

I could also just leave collector running in the Standby, as it will not be able to insert any data, since it will be Read-only database on the passive node. However, we will be seeing lot of error messages in the logs that it is unable to write.

I could certainly add some option to test if the configured connection points to a server in recovery, and in this case do nothing and check again every X seconds. But this would only leverage the easy part of the problem, as you'll soon hit other problems:

this assumes that you have a different IP/dns for each pg server in HA, so you can't use a vIP for the repository server (or at least you need additional IP for each nodes, with postgres listening on it)
this only handles the "easy" case where you have the old primary that is shut down or demoted, and the new one is promoted. If you end up with both nodes being primary (for instance if you have a network partition and you don't have fencing), this won't work. If the service stops for some reason, you need something else to restart it.

So really, if you want to do high availability for more than the postgres server itself, you should use a solution that can handle all your needs, otherwise you'll have something half baked which probably won't do what you want.

ioguix commented 4 years ago

Hi,

Out of curiosity, did you consider relying on PITR backup only with manual failover? If your RTO/RPO does not requires HA, this would be much more easier, maintainable and safer than a complex HA cluster. If you really need a second node for faster manual failover you can setup replication, but without automatic failover.

Regards,

hrawulwa commented 4 years ago

<<this only handles the "easy" case where you have the old primary that is shut down or demoted, and the new one is promoted. If you end up with both nodes being primary (for instance if you have a network partition and you don't have fencing), this won't work. If the service stops for some reason, you need something else to restart it.>> I understand in this scenario, would require manual intervention and make sure only one of the DB servers in the Primary mode. So, I believe I can have the script to check if the DB server in in recovery mode, and if yes, do nothing. Of course there will be scenarios you outlined like both nodes being primary, which needs to be manually fixed.

hrawulwa commented 4 years ago

<<Out of curiosity, did you consider relying on PITR backup only with manual failover? If your RTO/RPO does not requires HA, this would be much more easier, maintainable and safer than a complex HA cluster. If you really need a second node for faster manual failover you can setup replication, but without automatic failover.>> It is easier for us to provision master-slave setup as part of our automation infrastructure. As you suggested, if I go with manual failover, then I might very well manually shutdown collector process on old primary and startup on new primary. My objective here is to relocate the collector process, in the event auto failover happens on the Repository server. So, assuming that manual intervention is required for the scenarios you have outlined, do you concur that a simple script can be setup to run every X seconds, to check the if the server is in Read-write mode, and then only startup collector process?

Thanks Hari

ioguix commented 4 years ago

So, assuming that manual intervention is required for the scenarios you have outlined, do you concur that a simple script can be setup to run every X seconds, to check the if the server is in Read-write mode, and then only startup collector process?

Assuming a manual intervention is required to failover, I would just add to the procedure a step to start the collector. Keep it simple.

It resumes to how critical is the service, its required RTO/RPO and how much complexity, maintainability and risk you accept to achieve the goal. The hardest part is not to build and setup, it's maintaining it over the years.

powa-team / powa-collector

powa-collector failover to the primary repository DB server #5