Open Blackpaw opened 7 years ago
lizardfs-admin metadataserver-status --porcelain localhost 9421
Would seem to do the trick? would it be a good idea to do a "save-metadata" first or does it automatically flush any data on restart
The following logic has ended up working very well:
case $STATE in
"MASTER") logger -t lizardha -s "MASTER state"
ln -sf /etc/mfs/mfsmaster.master.cfg /etc/mfs/mfsmaster.cfg
// Get local master status
STATUS=$(lizardfs-admin metadataserver-status --porcelain localhost 9421)
logger -t lizardha -s "local master status = ${STATUS}"
# check if already master
if [[ $STATUS == *"master"* ]]; then
logger -t lizardha -s "Already master, doing nothing"
exit 0
fi
# Check if connected to current master
if [[ $STATUS == *"disconnected"* ]]; then
logger -t lizardha -s "Disconnected from master, restarting"
restart_master_server
else
logger -t lizardha -s "Connected to master, reloading"
systemctl reload lizardfs-master.service || systemctl start lizardfs-master.service
fi
# Exit
exit 0
;;
Flipping between master/shadow is very smooth, with no chunks going under goal. And I started up a number of VM's on the three nodes, so system was under a heavy load then hard reset all three simultaneously - system came backup automatically with no data loss.
Interesting. Are you using keepalived ? Could you please share your full keepalived configuration/up|down scripts?
Yes, Keepalived:
keepalived.conf
global_defs {
notification_email {
admin@*****.com.au
}
notification_email_from lb-alert@****.com.au
smtp_server smtp.emailsrvr.com
smtp_connect_timeout 30
}
vrrp_script check_lizardfs_master {
script "pidof mfsmaster > /dev/null"
interval 1 # check every 1 seconds
fall 1 # requires 1 failures
}
vrrp_instance VI_1 {
state EQUAL
interface ipprivate
virtual_router_id 51
priority 50
nopreempt
smtp_alert
advert_int 1
virtual_ipaddress {
10.10.10.249/24
192.168.5.249/24
}
notify "/etc/mfs/keepalived_notify.sh"
}
/etc/mfs/keepalived_notify.sh
#!/bin/bash
TYPE=$1
NAME=$2
STATE=$3
logger -t lizardha -s "Notify args = $*"
function restart_master_server() {
logger -t lizardha -s "Stopping lizardfs-master service"
systemctl stop lizardfs-master.service
if [ -f /var/lib/mfs/meta/metadata.mfs.lock ];
then
logger -t lizardha -s "Lock file found, assuming bad shutdown"
fi
logger -t lizardha -s "Starting lizardfs-master service"
systemctl restart lizardfs-master.service
systemctl restart lizardfs-cgiserv.service
logger -t lizardha -s "done."
}
case $STATE in
"MASTER") logger -t lizardha -s "MASTER state"
ln -sf /etc/mfs/mfsmaster.master.cfg /etc/mfs/mfsmaster.cfg
// Get local master status
STATUS=$(lizardfs-admin metadataserver-status --porcelain localhost 9421)
logger -t lizardha -s "local master status = ${STATUS}"
# check if already master
if [[ $STATUS == *"master"* ]]; then
logger -t lizardha -s "Already master, doing nothing"
exit 0
fi
# Check if connected to current master
if [[ $STATUS == *"disconnected"* ]]; then
logger -t lizardha -s "Disconnected from master, restarting"
restart_master_server
else
logger -t lizardha -s "Connected to master, reloading"
systemctl reload lizardfs-master.service || systemctl start lizardfs-master.service
fi
# Exit
exit 0
;;
"BACKUP") logger -t lizardha -s "BACKUP state"
ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 0
;;
"STOP") logger -t lizardha -s "STOP state"
# Do nothing for now
# ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
# systemctl stop lizardfs-master.service
exit 0
;;
"FAULT") logger -t lizardha -s "FAULT state"
ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 0
;;
*) logger -t lizardha -s "unknown state $STATE"
ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
restart_master_server
exit 1
;;
esac
However I've since disabled the keepalived conf and will handle failover manually while I consider some sodding big holes in it. There are issues with keepalived grabbing master before the network is active, leading to the metadata being back dated and I have concerns as to what could happen if a node was down for a while, then came back up and grabbed master.
A floating IP with a quorum election would be better, which is not possible with keepalived. Maybe investigate corosync as its alreayd integrated with proxmox.
Or pay skytechnology! If I can demonstrate the improved performance and flexibility to my boss we could probably swing it.
Or make pressure to sky technology to publish native HA solution, announced from so much time and never released
@Blackpaw I work with SkyTechnology on the LizardFS project and would be happy to work with you on showing improved performance with our uRaft solution. mark.mulrainey@lizardfs.com
@MarkyM as @guestisp mentioned, it was promised that uRaft would be released and/or open sourced during the next major LizardFS release, at the time believed to be 3.12.0.
Have plans changed? Free uraft is essential to some of my upcoming deployments + is a huge win for open LizardFS over MooseFS.
Agree with @Zorlin. I think everyone understands that things can take longer than expected, but it would be appreciated with a confirmation SkyTechnology will stand by that word, and an updated time estimate.
I think if you the people of SkyTechnology release the HA as opensource , then LizardFS will certainly gain extreme popularity and honor among SDS erena. And in long run the company will get more benefit commercially .To become a successful opensource project , it should be fully open sourced but not with limited set of feature or with commercially locked in features.
yes, we mentioned that - it was me to be precise :) Currently in 3.12 we will provide native NFS 3.0, 4.0 and pNFS 4.1 support based on Ganesha software - as someone already found that on our sources. Also in 3.12 we will provide support for ACL with AD integration. Our roadmap is based on customer requests for simple reason - we need funds to continue development. Most of it is released to open source straight away to share with everybody our ideas and work - open source is the way to go! Now going back to HA mechanism - we have to postpone that. There is still an ongoing debate which one to release - lizardfs-uraft or build-in HA in master module that is still under construction. We hope that will be happen in next 6 months.
OH great , Thank you for the update. looking forward for next 6 month. Actually I am delaying to put our LizardFS storage in production for a proper HA solution.
Thanks for the update blink, fascinating:
Currently in 3.12 we will provide native NFS 3.0, 4.0 and pNFS 4.1 support based on Ganesha software
VMware users will love that. And perhaps a iscsi plugin in the future?
ACL with AD integration.
Now that is interesting
There is still an ongoing debate which one to release - lizardfs-uraft or build-in HA in master module that is still under construction
I'm happy to wait, very important to get right. And after some experience with floating ip style failover using external tools (keepalived, ucarp), something integrated into the master sounds much more reliable. Look fwd to hearing more detail on that.
If you need alpha/beta testers I'm happy to fire up a few VM's on a vlan to test.
We are already testing a LizardFS setup with around 120TB storage but without HA. Running around 25 VM on LizardFS storage backend but not feeling confident enough to try Keepalived for HA. I am also happy to test any HA on our cluster with VM backend to check the performance.
+1 for beta test I'm testing right now lizard on a small cluster and the only thing preventing us to put it in production, is missing support for HA
I've seen that official docs is referring to uraft, maybe releasing that as-is for testing purposes, would result in much more tests done by the community
More tests means more bugfix and a more stable product
Anyway, I'm with @blackpaw, an integrated solution could be better than external tools Maybe you could integrate some failover mechanism in master so that there is no need to configure anything, just add as many shadow you want and lizard does the rest (Like ceph, just add MONs , no need to configure failover)
@psarna may be this is a stupid proposal but why not allowing a shadow to be also a master without reconfiguration ? Something like master-master replication in mysql.
HAProxy could be used in place of keepalived/ucarp, ..... monitoring 2 or more masters. As every shadow is always in sync and with my proposal there are no differences between shadow and master (all shadow can accept connections from clients), a failover is as simple as moving the IP or pointing the HAProxy backend to a different server.
HAProxy is able to run some periodic checks to every backend, so every shadow not connected to master will be removed, something like this:
$(lizardfs-admin metadataserver-status --porcelain backend.server.tld 9421) == "disconnected"
If current active master dies, haproxy will move connections on one of the survived host (also knowing which "shadows" are in sync or not, thanks to the monitoring script wrote above, thus if, in the cluster, additional shadows are out of sync, there won't be choosen by haproxy)
You don't have to run any external script during failover, as all shadow are also masters, thus, no failover is needed.
Anyway, if any master is able to maitain a list of connected shadow (and with shadows able to accept connections, like my proposal), mfsmount can be customized to be able to automarically reconnect to a different shadow automatically. On first connection, mfsmount fetchs a list of shadow nodes from the master node, then it does the same every X seconds to stay updated in case of new nodes (or removed nodes) A failover, in this way, will be managed natively by mfsmount. mfsmount is already able to reconnect to a master, but in this way it will be able to reconnect to a different one. Which one? Probably the one with higher metadata version number.
I'm using corosync+pacemaker for failover and it has worked great for the last two years. I am all for better integration but also want to ask that backwards compatibility be maintained for as long as possible. Would love to see shadows be able to do something beyond being a hot standby before seeing built in HA. I am not sure if i like the idea of build in HA or the ability to choose from many other HA options. It all depends on what the built in HA ends up looking like and how well it works.
I'm using corosync+pacemaker for failover and it has worked great for the last two years.
Did you publish your corosync+pacemaker ? Is that "opensource" ?
Would love to see shadows be able to do something beyond being a hot standby before seeing built in HA
Me too.
See some old HA issues #184 and #299 and #326 and #369 and #371 and #389 and #617 and ... And https://github.com/4Dolio/lizardfs/tree/4Dolio-corosync-ocf-metadataserver-patch-1/src/ha-cluster which is built on the original LizardFS metadataserver corosync script which almost worked. Please keep in mine that my scripts are not pretty and could be a little insane at times.
See also https://github.com/lizardfs/lizardfs/issues/613#issuecomment-340290619
Thanks a lot for the update blink! No problem with delays, or that some features appear in closed-source first, but given the MooseFS background I think users feel a lot more comfortable knowing you're firmly behind the open source model for all components.
Thank you for the update. I'm a little disappointed but I agree with everyone else - it's incredibly important to do it right! I really appreciate the transparency.
Oh, I thought that HA LizardFS was Open Source already. :confused: I actually decided to go with LizardFS because MooseFS didn't have an Open Source HA solution. I really need a clustered storage solution that I can run on my Docker Swarm. MooseFS and LizardFS look like they are perfect for my use-case, but I need to be able to run it HA.
If anybody could explain how they use an external tool like Keepalived from a high level, to help me understand what is necessary to get an Open Source HA solution running, that would be appreciated.
I actually just found a LizardFS blog post that says that HA is now open sourced. Is it just not yet documented?
OK, I found the documentation, but the lizardfs-uraft
package doesn't exist. Hopefully there is a way to download/build it. :crossed_fingers:
Could I contact sale for lizardfs-uraft or any new package?
Hi, drop me an email direct to mark.mulrainey@lizardfs.com
On Wed, Jan 2, 2019 at 4:39 AM Nhien notifications@github.com wrote:
Could I contact sale for lizardfs-uraft or any new package?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lizardfs/lizardfs/issues/598#issuecomment-450782398, or mute the thread https://github.com/notifications/unsubscribe-auth/AYDm2cCOjHya_m92QFwQMS8Msb-l_6fTks5u_CnlgaJpZM4PpwFM .
-- Regards,
Mark Mulrainey Storage Dilemma Solver +48 733 187 097 +1 949 299 9454
https://lizardfs.com/ https://www.linkedin.com/company/3295488/ https://twitter.com/LizardFS https://www.facebook.com/lizardfs/
Is is possible to test if a shadow has an active connection to a master?
I've run into a situation where a shadow is starting up and there is no master for it to connect to (first startup of cluster), then gets a command to reload as master which puts it in a failed state as it has no meta data.
I work around this by restarting the shadow so that it will always load the meta data from disk, but this often leads to missing chunks when promoting a active shadow to master
Alternatively is there a way fail the startup of a shadow if it can't connect to a master?
nb: This situation happens for my keepalived ha, since keeplived starts in BACKUP state, does elections and restarts the winner in MASTER.