What are the plans for HA failover moving forward?

lizardfs / lizardfs

LizardFS is an Open Source Distributed File System licensed under GPLv3.

http://lizardfs.com

GNU General Public License v3.0

953 stars 187 forks source link

What are the plans for HA failover moving forward? #299

Open 4Dolio opened 9 years ago

4Dolio commented 9 years ago

In reference to this commit: https://github.com/lizardfs/lizardfs/commit/02747c012d33ce8e79d9b3cb6e080af8b46e3b8e#diff-25d902c24283ab8cfbac54dfa101ad31 ha-cluster: Drop pacemaker/corosync stack from 2015 Jun 2nd.

I only just now noticed that you are dropping these corosync/pacemaker specific "ha-cluster" scripts. I am using corosync+pacemaker for my cluster and am using your ocf metadataserver script. But I did not use your lizardfs-cluster-manager "wizard" script, rather I manually configured the corosync/pacemaker services and configurations. With that said, I do agree that they are not very user friendly. I'm still at a lose for how it is supposed to coldboot/bootstrap/initialize the lizard cluster, I just haphazardly manage to bring it online. I would love an explanation about how this was supposed to work.

I also notice that there is an error in the ocf metadataserver script which would prevent it from working properly without patching that error. Perhaps this was intentional in order to prevent the less motivated from using this method and getting into trouble. I'm not sure you would want me to share the fix...

I also learned that the way the Debian7 distributes corosync and pacemaker is not compatible with the way your ha-cluster scripts expect those services to behave and interact. So I have a set of sed commands which fix the stock corosync and pacemaker services such that they work as expected. I had intended to share these after verifying that they allow for the lizardfs-cluster-manager script to work as intended.

In my experience my corosync/pacemaker driven ha-cluster had been very reliable (with the bootstrapping exception). I can confidently reboot any member of the cluster at will without worrying about the effects, which are imperceptible. I'm still working on migrating two moosefs systems to lizardfs, but I would really like to have automatic failover, so for the moment I intend to continue to use corosync+pacemaker.

I understand your dropping this stuff, but it appears that you kept the ha-cluster-managed personality for the master service intact? Which I expect needs to stay in place regardless of which type of ha cluster technology you end up using. I sort of expected LizardFS to remain semi failover technique agnostic such that one might use carp, or corosync/pacemaker, or any number of raft implementations. Can you speak to that point? I see mention of uraft, but would appreciate pointers to exactly what that is.

psarna commented 9 years ago

On 08.07.2015 08:06, 4Dolio wrote:

In reference to this commit: 02747c0#diff-25d902c24283ab8cfbac54dfa101ad31 https://github.com/lizardfs/lizardfs/commit/02747c012d33ce8e79d9b3cb6e080af8b46e3b8e#diff-25d902c24283ab8cfbac54dfa101ad31 ha-cluster: Drop pacemaker/corosync stack from 2015 Jun 2nd.

I only just now noticed that you are dropping these corosync/pacemaker specific "ha-cluster" scripts. I am using corosync+pacemaker for my cluster and am using your ocf metadataserver script. But I did not use your lizardfs-cluster-manager "wizard" script, rather I manually configured the corosync/pacemaker services and configurations. With that said, I do agree that they are not very user friendly. I'm still at a lose for how it is supposed to coldboot/bootstrap/initialize the lizard cluster, I just haphazardly manage to bring it online. I would love an explanation about how this was supposed to work.

The "wizard" script was in development stage and was configured to work fine in our testing environment. It wasn't ready for an official release yet. Since corosync/pacemaker solution we implemented continuously failed to pass our tests, the script was never improved.

I also notice that there is an error in the ocf metadataserver script which would prevent it from working properly without patching that error. Perhaps this was intentional in order to prevent the less motivated from using this method and getting into trouble. I'm not sure you would want me to share the fix...

No, it definitely wasn't intentional, our policy is to never publish code that contains bugs we know of. Please reveal the secret, it may help some administrators in their pain ;)

I also learned that the way the Debian7 distributes corosync and pacemaker is not compatible with the way your ha-cluster scripts expect those services to behave and interact. So I have a set of sed commands which fix the stock corosync and pacemaker services such that they work as expected. I had intended to share these after verifying that they allow for the lizardfs-cluster-manager script to work as intended.

We would appreciate if you share them anyway, corosync/pacemaker will not be an official HA manager for LizardFS, but it would still be a valuable piece of code for the community.

In my experience my corosync/pacemaker driven ha-cluster had been very reliable (with the bootstrapping exception). I can confidently reboot any member of the cluster at will without worrying about the effects, which are imperceptible. I'm still working on migrating two moosefs systems to lizardfs, but I would really like to have automatic failover, so for the moment I intend to continue to use corosync+pacemaker.

We will not include official scripts for corosync/pacemaker in our repo, but we certainly do not discourage using it, especially if you have a good experience with it.

I understand your dropping this stuff, but it appears that you kept the ha-cluster-managed personality for the master service intact? Which I expect needs to stay in place regardless of which type of ha cluster technology you end up using. I sort of expected LizardFS to remain semi failover technique agnostic such that one might use carp, or corosync/pacemaker, or any number of raft implementations. Can you speak to that point? I see mention of uraft, but would appreciate pointers to exactly what that is.

All functionalities that allow LizardFS to cooperate with corosync/pacemaker will remain intact. If anything, they will be enhanced and extended, so stay tuned for updates that have "ha" keyword, they may provide some useful tweaks for your scripts.

— Reply to this email directly or view it on GitHub https://github.com/lizardfs/lizardfs/issues/299.

4Dolio commented 9 years ago

All sounds great, thanks for the quick reply. Ill post some solutions tomorrow. Yes I feel silly not knowing how the cluster was intended to be cold started, how was it intended to work for your testing? Which parts of your tests were problematic?

psarna commented 9 years ago

We had problems with the stack even before we started wondering if cluster cold starts properly on supported systems. You could either specify your question or just ignore my answer :) (both are fine with me)
Major problems:
- the stack was not easily portable (e.g. as you already discovered Debian 7 didn't work properly, I'm not even mentioning CentOS)
- corosync tended to livelock, consuming 100%CPU and it was really unpleasant to debug (in fact, I would still love to know why it happened).
- Configuring the stack correctly is hard, finding out what was configured incorrectly even harder, so it took a lot of time to set up a working test environment (and keep it working all the time).

4Dolio commented 9 years ago

Sorry about the long delay.. I will also attempt to send merge requests, but I'm not sure how right now. Here are the promised solutions in the form of some commands:

First, corosync and pacemaker on Debian 7 are not in the correct operating mode. They are distributed to run in mode version 0 (which is a dependent based mode where by corosync manages(stops/starts) pacemaker). But they need to run in mode version 1 (which is where they are semi-independent of each other to allow for them to be stopped and started individually). root@node:~$sed -i 's/^.*ver:.*/ \tver: 1/' /etc/corosync/corosync.conf # set to version 1

Now they still need to be stopped and started in the proper order. Stop pacemaker then corosync. Start corosync then pacemaker. But corosync no longer manages pacemaker per the above version change. So pacemaker must be setup with proper dependencies and to start and stop at normal runlevels. root@node:~$sed -i 's/^# Required-Stop:.*/# Required-Stop:\t$network corosync/' /etc/init.d/pacemaker # pacemaker must be stopped before corosync root@node:~$sed -i 's/^# Default-Start:.*/# Default-Start:\t2 3 4 5/' /etc/init.d/pacemaker # let it be started independently at normal runlevels root@node:~$sed -i 's/^# Default-Stop:.*/# Default-Stop: \t0 1 6/' /etc/init.d/pacemaker # let it be stopped root@node:~$update-rc.d pacemaker defaults ; update-rc.d corosync defaults # apply the above runlevel changes root@node:~$sed -i 's/START=no/START=yes/' /etc/default/corosync # let corosync start on boot

Next, I did not use the lizardfs-cluster-manager script, so I manually setup the corosync cluster: root@node:~$corosync-keygen # and distrubute new key to other nodes as /etc/corosync/authkey root@node:~$sed -i 's/mcastaddr: 226.94.1.1/mcastaddr: 226.94.1.220/' /etc/corosync/corosync.conf # tune to my environment root@node:~$sed -i 's/bindnetaddr: 127.0.0.1/bindnetaddr: 10.80.8.0/' /etc/corosync/corosync.conf # tune to my environment root@node:~$ ..... # do the other corosync and pacemaker configuration as prescribed by the lizardfs-cluster-manager script and other Lizard documentation after starting corosync and then pacemaker (By the way it may help to have a 1 second delay after starting corosync before starting pacemaker).

And finally... There is a typo in the metadataserver script which should have prevented it from ever working at all. See https://github.com/lizardfs/lizardfs/blob/2.6.0-wip/src/ha-cluster/metadataserver.in#L85 exports_cfg=$(read_cfg_var ${OCF_RESKEY_master_cfg} WORKING_GROUP = @ETC_PATH@/mfs/mfsexports.cfg) # is setting the exports to the working group value?? root@node:~$sed -i 's/^exports_cfg.*/exports_cfg=\/etc\/mfs\/mfsexports.cfg/' /usr/lib/ocf/resource.d/lizardfs/metadataserver # After make install I just fix it by hard coding the correct value.

So, That is it... This should work for an empty lizardfs cluster on Debian 7... However, as the metadata set grows and it takes longer to start/stop the metamaster service the proposed corosync cluster config and the existing metadataserver script begin to fail. They fail because they either have too short a timeout, and/or because the metamaster server service stops responding to the network queries too early while stopping or when it is doing certain startup processes... I'm still working on tweaks to allow for the growth of the metadata size and the increasing time it takes to start and stop. I think that I just recently managed to address and fix these problems, but I need to test them more and clean them up as I'm still not entirely happy with how I "fixed" it...

Zorlin commented 9 years ago

Thanks for your contribution. Hopefully this will help some people and maybe speed up development of the mainline solutions. On Jul 18, 2015 12:04 PM, "4Dolio" notifications@github.com wrote:

Sorry about the long delay.. I will also attempt to send merge requests, but I'm not sure how right now. Here are the promised solutions in the form of some commands:

First, corosync and pacemaker on Debian 7 are not in the correct operating mode. They are distributed to run in mode version 0 (which is a dependent based mode where by corosync manages(stops/starts) pacemaker). But they need to run in mode version 1 (which is where they are semi-independent of each other to allow for them to be stopped and started individually). root@node:~$sed -i 's/^.ver:./ \tver: 1/' /etc/corosync/corosync.conf # set to version 1

Now they still need to be stopped and started in the proper order. Stop pacemaker then corosync. Start corosync then pacemaker. But corosync no longer manages pacemaker per the above version change. So pacemaker must be setup with proper dependencies and to start and stop at normal runlevels. root@node:~$sed -i 's/^# Required-Stop:./# Required-Stop:\t$network corosync/' /etc/init.d/pacemaker # pacemaker must be stopped before corosync root@node:~$sed -i 's/^# Default-Start:./# Default-Start:\t2 3 4 5/' /etc/init.d/pacemaker # let it be started independently at normal runlevels root@node:~$sed -i 's/^# Default-Stop:.*/# Default-Stop: \t0 1 6/' /etc/init.d/pacemaker # let it be stopped root@node:~$update-rc.d pacemaker defaults ; update-rc.d corosync defaults

apply the above runlevel changes

root@node:~$sed -i 's/START=no/START=yes/' /etc/default/corosync # let corosync start on boot

Next, I did not use the lizardfs-cluster-manager script, so I manually setup the corosync cluster: root@node:~$corosync-keygen # and distrubute new key to other nodes as /etc/corosync/authkey root@node:~$sed -i 's/mcastaddr: 226.94.1.1/mcastaddr: 226.94.1.220/' /etc/corosync/corosync.conf # tune to my environment root@node:~$sed -i 's/bindnetaddr: 127.0.0.1/bindnetaddr: 10.80.8.0/' /etc/corosync/corosync.conf # tune to my environment root@node:~$ ..... # do the other corosync and pacemaker configuration as prescribed by the lizardfs-cluster-manager script and other Lizard documentation after starting corosync and then pacemaker (By the way it may help to have a 1 second delay after starting corosync before starting pacemaker).

And finally... There is a typo in the metadataserver script which should have prevented it from ever working at all. See https://github.com/lizardfs/lizardfs/blob/2.6.0-wip/src/ha-cluster/metadataserver.in#L85 exports_cfg=$(read_cfg_var ${OCF_RESKEY_master_cfg} WORKING_GROUP = @ETC_PATH@/mfs/mfsexports.cfg) # is setting the exports to the working group value?? root@node:~$sed -i 's/^exports_cfg.*/exports_cfg=\/etc\/mfs\/mfsexports.cfg/' /usr/lib/ocf/resource.d/lizardfs/metadataserver # After make install I just fix it by hard coding the correct value.

So, That is it... This should work for an empty lizardfs cluster on Debian 7... However, as the metadata set grows and it takes longer to start/stop the metamaster service the proposed corosync cluster config and the existing metadataserver script begin to fail. They fail because they either have too short a timeout, and/or because the metamaster server service stops responding to the network queries too early while stopping or when it is doing certain startup processes... I'm still working on tweaks to allow for the growth of the metadata size and the increasing time it takes to start and stop. I think that I just recently managed to address and fix these problems, but I need to test them more and clean them up as I'm still not entirely happy with how I "fixed" it...

— Reply to this email directly or view it on GitHub https://github.com/lizardfs/lizardfs/issues/299#issuecomment-122480267.

4Dolio commented 9 years ago

Now for some more recent observations and tweaks for corosync:

I have added a stop operation, and also set the start and stop timeout values to 30 minutes. Our largest production metadata set size is 10GB on disk and ~40GiB in RAM, so it can take ~15 minutes to cold start and cold stop. op start interval="0" timeout="1800" \ op stop interval="0" timeout="1800" \

I added these changes to the ocf resource metadataserver script. The master process can stop responding to network based probes while it is stopping and while it is starting in shadow mode. So, if the process is running, But we do not get a network based probe response, then check for the lock file and assume that we might still be starting or stopping and return that we are still running but we have no metadata available. I also un-commented the notify option, shrugs:

root@node:~$diff -ruN ./metadataserver /usr/lib/ocf/resource.d/lizardfs/metadataserver --- ./metadataserver 2015-07-18 12:04:50.999944151 +0800 +++ /usr/lib/ocf/resource.d/lizardfs/metadataserver 2015-07-18 10:03:35.806791055 +0800 @@ -386,9 +386,18 @@ # master probe_result=$(lizardfs_probe) if [ $? -ne 0 ] ; then +# Failed to query via port. +# But we might still be starting or stopping... +# Check if we have a lock file before deciding our state + if [ -e "$master_lock" ] ; then + ocf_log debug "LizardFS metadata server might be starting or stopping." + update_master_score ${score_shadow_no_metadata} + return $OCF_RUNNING_MASTER + else ocf_log err "failed to query LizardFS master status" return $OCF_ERR_GENERIC fi + fi local personality=$(echo "$probe_result" | cut -f1) local connection=$(echo "$probe_result" | cut -f2) local metadata_version=$(echo "$probe_result" | cut -f3) @@ -549,7 +558,7 @@ reload) lizardfs_master_reload;; promote) lizardfs_master_promote;; demote) lizardfs_master_demote;; - # notify) lizardfs_master_notify;; + notify) lizardfs_master_notify;; # We have already validated by now validate-all) ;; *) usage; exit $OCF_ERR_UNIMPLEMENTED;;

The above struck out patch ended up not working as expected... Much of the rest of the observations do still apply however. Might have a "better" patch later...

With a somewhat smaller (1.5GB on disk and 5GiB in RAM) metadata set. I observed that while stopping the shadow master it takes ~33 seconds to stop, which is longer than the default timeout of 20 seconds. So corosync would believe the stop failed, report that node as having failed now unmanaged. You must then manually intervene and execute crm resource cleanup lizardfs-ms to clear this false error state before that node can be managed by corosync again.

Similarly I observed that when a shadow master is started it comes online and responds to the network probes from the ocf resource metadataserver script almost immediately. But it is actually downloading new metadata from the cluster master. Once it finishes downloading it then reloads that new metadata and stops responding while it does so. Even a 'small' cluster with 5GiB of metadata requires about a minute to re-process the new metadata, during which time it has stopped responding. So corosync believes it has failed and starts killing it or trying again, forever...

I believe that the new 30 minute start/stop timeouts and the patch I posted sort of resolve these start/stop problems. The patch allows me to keep the op monitor intervals nice and short because it now returns "we might be starting or stopping" I'M OK. The 30 minutes allows for plenty of time to cleanly startup or shutdown. But ultimately I think it would be best if the metamaster service itself would continue to respond to these network based queries during these potentially very long duration start and stop events. And you guys likely know better how to change your script than what I did with this lousy patch, which I am well aware could be flawed.

4Dolio commented 9 years ago

This patch appears to work. It can properly identify when the mfsmaster service is in the process of stopping, and will now wait. It can properly identify when the mfsmaster shadow service is not responding to the network probes, because it is re-processing new metadata received from the master node. It falsely believes that while the master node is starting and loading the metadata that the node is in a shadow/stopping state, but the result is acceptable in that it waits for the service to finish starting up.

I have also re-tuned corosync with a 1s interval for the monitor Master op. a 2s interval for the monitor Slave op. All of the start, stop, promote, demote ops now have a 1800 timeout.

corosync still sets an error upon the demotion of the master node, but this seems fine as such a node will still come up as a shadow and can still be successfully be promoted. When the master node is being stopped by corosync, the cluster appears to not promote a shadow until the master has completely stopped. This is a new behavior and I am unsure if it is caused by my patch, by being run in a completely different environment, or because my previous experience was with LizardFS 2.5.5, and the version of the metadataserver script from that same version (Will look into that more). I suspect it is still due to a problem with the ocf resource metadataserver script, which should promote a shadow as soon as the master is told to stop. On the bright side I can now cold start the cluster, however it may simply be bringing the first live node up as the master, rather than choosing one based on the best available metadata version given the past cluster size/member count.

One additional observation is that when a shadow master is promoted it appears to be losing the session information and all mounts become stale. Once again, I do not know if this is due to my environment being different, changes from version 2.5.5 to 2.6.0, or something else. The shadow appears to have a valid and up to date copy of the sessions.mfs file, though I have not examined it's contents. And the stats.mfs which is the source for the master charts appears to be properly shared between nodes as they transition so I'm not sure why the sessions are lost..

root@node:~$diff -ruN ./metadataserver.2.6.0 /usr/lib/ocf/resource.d/lizardfs/metadataserver

--- ./metadataserver.2.6.0  2015-07-18 12:04:50.999944151 +0800
+++ /usr/lib/ocf/resource.d/lizardfs/metadataserver 2015-07-21 13:50:21.249601178 +0800
@@ -352,7 +352,24 @@
    else
        host=$matocl_host
    fi
-   lizardfs-admin metadataserver-status --porcelain "${host}" "$matocl_port"
+# Send any errors to /tmp/lizardfs_probe for later examination
+   probe_results=$(lizardfs-admin metadataserver-status --porcelain "${host}" "$matocl_port" 2>|/tmp/lizardfs_probe)
+        ret=$?
+        if [ $ret -eq 0 ] ; then
+       echo "$probe_results"
+   else
+# Error output included ENOTCONN so the shadow service is stopping, everything is ok, just wait.
+       if [ `grep -c ENOTCONN /tmp/lizardfs_probe` -gt 0 ] ; then
+           echo -e "shadow\tstopping"
+# Check for recent file activity and return a unique response if there is file activity in the last minute.
+       elif [ ! -z 'find ${data_dir}/ -mmin -1' ] ; then
+           echo -e "shadow\tsyncing"
+       fi
+   fi
+# else we returned nothing? shrugs
+# Verified `killall mfsmaster` externally kills mfsmaster that corosync will wait for a clean exit and then start shadow master again.
+# Verified `killall -9 mfsmaster` (Crash) kills mfsmaster than `mfsmetarestore -a ; crm resource cleanup lizardfs-ms` is needed to restore node.
+# If mfsmaster is killed externally then pacemaker may "lose track" of the process? causing pacemaker to not attempt to stop it when pacemaker is stopped.
 }

 update_master_score() {
@@ -400,6 +417,18 @@
                set_metadata_version "${metadata_version}"
                return $OCF_RUNNING_MASTER
            ;;
+           shadow/stopping)
+               ocf_log debug "running in shadow mode, service is stopping."
+               # Do not promote shadow if it is shutting down.
+               update_master_score ${score_shadow_no_metadata}
+               return $OCF_SUCCESS
+           ;;
+           shadow/syncing)
+               ocf_log debug "running in shadow mode, syncing up with master."
+               # Do not promote shadow which is syncing up to the master during startup
+               update_master_score ${score_shadow_no_metadata}
+               return $OCF_SUCCESS
+           ;;
            shadow/connected|shadow/disconnected)
                local cluster_metadata_version=$(get_metadata_version)
                if [[ ( $? != 0 ) || ( ${cluster_metadata_version} == "" ) ]] ; then

4Dolio commented 9 years ago

Maybe in version 2.5.5 the metadataserver script used to stop --quick option when stopping the master? which would be almost instant... perhaps I'll play around with that tomorrow. I think ideally a clean stop is better, if it could promote a shadow without waiting for the clean stop....

4Dolio commented 9 years ago

Adding the old --quick function back and calling that from the demotion routine helped.

4Dolio commented 9 years ago

I believe I am stuck just shy of getting corosync to behave exactly as I would like which is completely without any intervention given there are no catastrophic cluster faults (power out).

I am using the quickstop for all demotion and stop state transitions, which works great except: 1- If a master is demoted it can only rejoin as a shadow (something in the metadata prevents it from starting cleanly, but I do not know what). This is perfect, I only want a demoted member to ever become a shadow without requiring some manual intervention. 2- Unfortunately when a shadow master is quick stopped it does not behave this way. It can still be promoted to the first master of the cluster. This could be unsafe as it's metadata might be old. I believe that ideally I would like for a quick stopped shadow to behave the same way as a quickly stopped master does. 3- Finally, the quick stop call will refuse to quick stop if there are no loggers/shadows online and connected. This happens outside of corosyncs control. This is great because when the quick stop fails we execute a normal clean stop.

The reasoning behind all of this is that in the case of a complete stop of the entire cluster only the last master to stop will be able to be started without any other cluster members. Any other cluster member would require manual intervention or would only be able to join as a shadow. This should allow for a completely hands off cluster full stop and full start. I expect that member servers will only be transitioned in the case of a system reboot, or a corosync/pacemaker stop/start, or of course in the case of a failure of the master service itself (which I do not believe I have ever seen in the wild). In all stop cases, including abrupt power failure, all nodes are writing changelog files, so metadatarestore could be used on any node to manually find the best one to recover with...

Can anyone shed some light on why the metadata of a master server which is quickly stopped prevents that service from starting as a master? Can that same behavior be applied to a quickly stopped shadow server? Am I completely off base in my reasoning?

psarna commented 9 years ago

The quick-stop feature was designed for stopping master server without forcing it to dump all metadata to disk first. It was done only to assure, that master -> shadow demotion is fast. Conclusions:

Shadow masters were never considered to be a target of 'quick-stop' - but, even after quick-stopping them, they would not be promoted until their metadata is as high as in other nodes in quorum (if the cluster configuration/implementation is correct).
Master server which have been quick-stopped is considered as 'stale', because it was denied the opportunity to write back its metadata. That's exactly why it is prevented from starting as master again.
If no shadow masters are connected, there are no fresh backup copies of filesystem's metadata. In that scenario, quick-stop is basically a suicide. That's why it is prevented from happening.

borkd commented 9 years ago

@4Dolio and development team - I am a big fan of automation, so this post is just some food for thought and a way to track the progress Lizardfs makes in this area. I'm assuming changes to the way HA is and will be handled are aimed for automation of production use.

What design steps do you intend to take to defend against two (or more) active masters during network partition(s)? Will there be a reconciliation procedure after the network heals the split brain? If the loss/corruption cannot be reconciled what is the backup plan? Full-on automation can do an amazing amount of potentially irreparable damage really fast on a busy system. Meatware remains relevant when it comes to critical bits.

Actions described below could help to reconcile a split brain mess, but RAM and disk usage should be watched carefully on all metadata servers. I am sure dev team will have some better ideas.

active master creates a rollback point upon quorum change or significant topology change (active master partitioned from the shadows would observe that) -- an automatic filesystem and metadata snapshot could be a great start here.
automatic filesystem and metadata snapshot happens on (possibly isolated) shadow after it won the election or got promoted by other means, including (re)start with a config file stating master role, prior to it serving any requests from the clients or directing chunkservers.
upon discovery of master IP conflict masters and shadows go into a freeze (via iolimits/iptables/other) leaving the decision of which branch of changes gets to live on to the admin.

Destroying an extra snapshot here or there, while time consuming, would be a manageable way to deal with typical day-to-day topology changes.

Why this is important?: https://aphyr.com/tags/jepsen and https://aphyr.com/posts/288-the-network-is-reliable

4Dolio commented 9 years ago

Thank you for bringing up this very good topic to consider. At this point I have not considered network faults resulting in split brain. I'm not sure how my current implementation might react, but I'll begin to do some thought exercises and perhaps test environment experiments. I'll try to look into this in more detail in the near future and reply at that time... Also hoping to do a pull request with my improved script soon.

richard-scott commented 8 years ago

I've got HA to work using UCARP and a couple of bash scripts to change PERSONALITY to "master" on the elected shadow node and run mfsmaster reload to promote it to "master" status.

One thing I do notice is that during the fail over I lose my entry:

mfs#mfsmaster:9421   17G     0   17G   0% /mnt/volume

for a second or so from the output of "df -h". I don't think it stops functionality, but I'm not sure. The entry in the output from "mount" is still there during the fail-over

mfs#mfsmaster:9421 on /mnt/volume type fuse (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other)

fail over takes less than 3 seconds :-)

onlyjob commented 8 years ago

@richard-scott, a while ago I've described UCARP-based HA configuration. However IMHO UCARP is not suitable for production. The problem with UCARP is that HA between only two hosts is intrinsically unreliable when communication fails. Suppose switch is down for ~10 seconds (e.g. power cycle) and there is no connection between master and shadow-master. Ucarp feels that other side disappears and promotes shadow master to master. Now connection restores and, clash(!), you have two masters and two machines claiming the same IP. I believe only consensus based HA solution of three nodes (PAXOS or Raft based) would be reliable for such scenario.

richard-scott commented 8 years ago

The problem with UCARP is that HA between only two hosts is intrinsically unreliable when communication fails.

That is a misleading comment... any HA solution has problems if it has no network connectivity ;-)

and yes you can't rely on on a HA solution consisting of only two nodes... In my setup there needs to be a number of different criteria met before a shadow can be promoted to be a master.

onlyjob commented 8 years ago

any HA solution has problems if it has no network connectivity ;

That sounds like generalisation. I was talking specifically about temporary loss of connectivity. Besides Raft (e.g. etcd, fleet) is doing fairly well in such situation.

4Dolio commented 8 years ago

I believe I have begun a pull request for corosync ocf metadataserver.in script which now works properly with 2.6.0: https://github.com/4Dolio/lizardfs-corosync_ocf_metadataserver/tree/4Dolio-corosync-ocf-metadataserver-patch-1 I will probably continue to work on this and also test compatibility with 3.x branch.

4Dolio commented 8 years ago

I have added a new SetupCorosyncServices.sh and InitializeCorosync.sh which help to fix the services to work properly and then initialize a clusters settings: https://github.com/4Dolio/lizardfs/tree/4Dolio-corosync-ocf-metadataserver-patch-1/src/ha-cluster

blink69 commented 8 years ago

We will put officialy uraft package into 3.12.0 package so this and #326 can be closed then. :)

hoonetorg commented 6 years ago

Hi @blink69 3.12.0 is out. Providing uraft soon will give you lots of attention. I began testing lizardfs, because of your comment and I'm impressed. I tested a lot of SDNs/HA storage (Ceph, DRBD, Gluster etc.) and LizardFS is a good mixture.