signal18 / replication-manager

Signal 18 repman - Replication Manager for MySQL / MariaDB / Percona Server
https://signal18.io/products/srm
GNU General Public License v3.0
656 stars 168 forks source link

consul ha #253

Open tangweichun opened 6 years ago

tangweichun commented 6 years ago

Hi~ This is a simple topology:

image

If the replication-manager is down,it's not a big deal.But if the consul client is crash or stopped,all application won't work,because it can't get the service name from consul server.

As i known,until now replication-manager still not support remote consul cluster,and when the local consul client is down,replication-manager will automate unregister services.

So,even if the consul and replication-manager is HA,but the consul cilent that conjunction with replication-manager is still single-point.If I use replication-manager to manager tens or hundreds or mysql replication,oneday the consul client is down,bomb!

Is there anything helpful for this issue?

svaroqui commented 6 years ago

Yes ,

Just have a corosync cluster failover with consul agent + replication manager Corrosync is multi a ressources failover…

/svar

Le 17 sept. 2018 à 04:51, tangweichun notifications@github.com a écrit :

Hi~ This is a simple topology:

https://user-images.githubusercontent.com/29778335/45603053-9e45d600-ba5a-11e8-96cd-1d0bb6ebb235.png If the replication-manager is down,it's not a big deal.But if the consul client is crash or stopped,all application won't work,because it can't get the service name from consul server.

As i known,until now replication-manager still not support remote consul cluster,and when the local consul client is down,replication-manager will automate unregister services.

So,even if the consul and replication-manager is HA,but the consul cilent that conjunction with replication-manager is still single-point.If I use replication-manager to manager tens or hundreds or mysql replication,oneday the consul client is down,bomb!

Is there anything helpful for this issue?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/signal18/replication-manager/issues/253, or mute the thread https://github.com/notifications/unsubscribe-auth/AC1RIAd0bBORruv6MDeGUJtISDyTk4Ekks5ubw4wgaJpZM4WrMTu.

Stéphane Varoqui, VP of Products Phone: +33 695-926-401, skype: svaroqui https://signal18.io/ https://signal18.io/

tangweichun commented 6 years ago

Thanks!it's too complex!

svaroqui commented 6 years ago

Re,

"until now replication-manager still not support remote consul cluster » Local agent VS Consul API will not change anything to the issue. The local agent is a member of the entire consul cluster so it’ exactly like talking to the full cluster with an api. If the agent can leave the cluster equivalent consul API will do as well.

In this case yes there is a divergence possible between DNS content and DB topology. Such topic can be address by comparing topology and DNS content to see if it really match is that what you worry about?

in 2.1 we do have an active/passive solution call arbitrator, that can decide who is the replication-manager active and who is passive , but it stay less advance compared to long existing heartbeat solution like corosync or opensvc where the heartbeat can be setup with stonith scripts or can spread around multiple solution like http to tcp to udp to share disk(opensvc only) .

/svar

Le 17 sept. 2018 à 08:32, tangweichun notifications@github.com a écrit :

Thanks!it's too complex!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/signal18/replication-manager/issues/253#issuecomment-421902620, or mute the thread https://github.com/notifications/unsubscribe-auth/AC1RIBBGby1Xxa80YBu0vyUat6VpFp2Xks5ub0HzgaJpZM4WrMTu.

Stéphane Varoqui, VP of Products Phone: +33 695-926-401, skype: svaroqui https://signal18.io/ https://signal18.io/

tangweichun commented 6 years ago

Yes,I worry about the single point of failue,if the local consul agent is down,everything is down.It's terriable in production

tangweichun commented 6 years ago

I'm learning about corosync...

tangweichun commented 6 years ago

The single-point failure,i mean, replication-manager is alive but only local consul agent is down

svaroqui commented 6 years ago

Le 17 sept. 2018 à 09:26, tangweichun notifications@github.com a écrit :

Yes,I worry about the single point of failue,if the local consul agent is down,everything is down.It's terriable in production

Huuu why would everything be down , all others servers can still talk to the consul cluster and will keep same server resolution as previously witch is an issue as the DB topology can have been change by replication-manager So what would be more relevant is to not failover if consul is down correct ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/signal18/replication-manager/issues/253#issuecomment-421913512, or mute the thread https://github.com/notifications/unsubscribe-auth/AC1RIP5JDNW4iWoylwcNdgSeetLEwmDuks5ub06vgaJpZM4WrMTu.

Stéphane Varoqui, VP of Products Phone: +33 695-926-401, skype: svaroqui https://signal18.io/ https://signal18.io/

tangweichun commented 6 years ago

normal status: [root@orabackup /root]$ date&&nslookup write_mysql57.service.consul Mon Sep 17 15:44:42 CST 2018 Server: 172.17.11.242 Address: 172.17.11.242#53

Name: write_mysql57.service.consul Address: 172.17.5.101

[root@orabackup /root]$ date&&nslookup read_mysql57.service.consul Mon Sep 17 15:44:56 CST 2018 Server: 172.17.11.242 Address: 172.17.11.242#53

Name: read_mysql57.service.consul Address: 172.17.5.201 Name: read_mysql57.service.consul Address: 172.17.11.242

after i stop the local consul agent: [root@orabackup /root]$ date&&nslookup write_mysql57.service.consul Mon Sep 17 15:49:19 CST 2018 Server: 172.17.11.242 Address: 172.17.11.242#53

** server can't find write_mysql57.service.consul: NXDOMAIN

[root@orabackup /root]$ date&&nslookup read_mysql57.service.consul Mon Sep 17 15:49:24 CST 2018 Server: 172.17.11.242 Address: 172.17.11.242#53

** server can't find read_mysql57.service.consul: NXDOMAIN

can't get the services from consul dns

tangweichun commented 6 years ago

As i known,until now replication-manager still not support remote consul cluster,and when the local consul client is down,replication-manager will automate unregister services.

It seems like will unregister services when stopping local consul agent

svaroqui commented 6 years ago

Le 17 sept. 2018 à 09:57, tangweichun notifications@github.com a écrit :

As i known,until now replication-manager still not support remote consul cluster,and when the local consul client is down,replication-manager will automate unregister services.

It seems like will unregister services when stopping local consul agent

That’s the local DNS resolution that is blocked but should not be from other nodes where DNS point to your consul cluster

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/signal18/replication-manager/issues/253#issuecomment-421919772, or mute the thread https://github.com/notifications/unsubscribe-auth/AC1RIAW7SKHIJ6OBx-TGTWb6i3fTNM_hks5ub1YBgaJpZM4WrMTu.

Stéphane Varoqui, VP of Products Phone: +33 695-926-401, skype: svaroqui https://signal18.io/ https://signal18.io/