lizardfs / lizardfs

LizardFS is an Open Source Distributed File System licensed under GPLv3.
http://lizardfs.com
GNU General Public License v3.0
957 stars 187 forks source link

Test if Shadow is connected to master #598

Open Blackpaw opened 7 years ago

Blackpaw commented 7 years ago

Is is possible to test if a shadow has an active connection to a master?

I've run into a situation where a shadow is starting up and there is no master for it to connect to (first startup of cluster), then gets a command to reload as master which puts it in a failed state as it has no meta data.

I work around this by restarting the shadow so that it will always load the meta data from disk, but this often leads to missing chunks when promoting a active shadow to master

Alternatively is there a way fail the startup of a shadow if it can't connect to a master?

nb: This situation happens for my keepalived ha, since keeplived starts in BACKUP state, does elections and restarts the winner in MASTER.

Blackpaw commented 7 years ago

lizardfs-admin metadataserver-status --porcelain localhost 9421

Would seem to do the trick? would it be a good idea to do a "save-metadata" first or does it automatically flush any data on restart

Blackpaw commented 7 years ago

The following logic has ended up working very well:

case $STATE in
        "MASTER") logger -t lizardha  -s "MASTER state"
                  ln -sf /etc/mfs/mfsmaster.master.cfg /etc/mfs/mfsmaster.cfg

                  // Get local master status
                  STATUS=$(lizardfs-admin metadataserver-status --porcelain localhost 9421)
                  logger -t lizardha  -s "local master status = ${STATUS}"

                  # check if already master
                  if [[ $STATUS == *"master"* ]]; then
                    logger -t lizardha  -s "Already master, doing nothing"
                    exit 0
                  fi

                  # Check if connected to current master
                  if [[ $STATUS == *"disconnected"* ]]; then
                    logger -t lizardha  -s "Disconnected from master, restarting"
                    restart_master_server
                  else
                    logger -t lizardha  -s "Connected to master, reloading"
                    systemctl reload lizardfs-master.service || systemctl start lizardfs-master.service
                  fi

                  # Exit
                  exit 0
                  ;;

Flipping between master/shadow is very smooth, with no chunks going under goal. And I started up a number of VM's on the three nodes, so system was under a heavy load then hard reset all three simultaneously - system came backup automatically with no data loss.

guestisp commented 7 years ago

Interesting. Are you using keepalived ? Could you please share your full keepalived configuration/up|down scripts?

Blackpaw commented 7 years ago

Yes, Keepalived:

keepalived.conf

global_defs {
  notification_email {
    admin@*****.com.au
  }
  notification_email_from lb-alert@****.com.au
  smtp_server smtp.emailsrvr.com
  smtp_connect_timeout 30
}

vrrp_script check_lizardfs_master {
        script   "pidof mfsmaster > /dev/null"
        interval 1                      # check every 1 seconds
        fall     1                      # requires 1 failures
}

vrrp_instance VI_1 {
    state EQUAL
    interface ipprivate
    virtual_router_id 51
    priority 50
    nopreempt
    smtp_alert
    advert_int 1
    virtual_ipaddress {
        10.10.10.249/24
        192.168.5.249/24
    }
    notify "/etc/mfs/keepalived_notify.sh"
}

/etc/mfs/keepalived_notify.sh


#!/bin/bash

TYPE=$1
NAME=$2
STATE=$3

logger -t lizardha  -s "Notify args = $*"

function restart_master_server() {
        logger -t lizardha  -s "Stopping lizardfs-master service"
        systemctl stop lizardfs-master.service
        if [ -f /var/lib/mfs/meta/metadata.mfs.lock ];
        then
                logger -t lizardha  -s "Lock file found, assuming bad shutdown"
        fi
        logger -t lizardha  -s "Starting lizardfs-master service"
        systemctl restart lizardfs-master.service
        systemctl restart lizardfs-cgiserv.service
        logger -t lizardha  -s "done."
}

case $STATE in
        "MASTER") logger -t lizardha  -s "MASTER state"
                  ln -sf /etc/mfs/mfsmaster.master.cfg /etc/mfs/mfsmaster.cfg

                  // Get local master status
                  STATUS=$(lizardfs-admin metadataserver-status --porcelain localhost 9421)
                  logger -t lizardha  -s "local master status = ${STATUS}"

                  # check if already master
                  if [[ $STATUS == *"master"* ]]; then
                    logger -t lizardha  -s "Already master, doing nothing"
                    exit 0
                  fi

                  # Check if connected to current master
                  if [[ $STATUS == *"disconnected"* ]]; then
                    logger -t lizardha  -s "Disconnected from master, restarting"
                    restart_master_server
                  else
                    logger -t lizardha  -s "Connected to master, reloading"
                    systemctl reload lizardfs-master.service || systemctl start lizardfs-master.service
                  fi

                  # Exit
                  exit 0
                  ;;
        "BACKUP") logger -t lizardha  -s "BACKUP state"
                  ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
                  restart_master_server
                  exit 0
                  ;;
        "STOP")  logger -t lizardha  -s "STOP state"
                  # Do nothing for now
                  # ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
                  # systemctl stop lizardfs-master.service
                  exit 0
                  ;;
        "FAULT")  logger -t lizardha  -s "FAULT state"
                  ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
                  restart_master_server
                  exit 0
                  ;;
        *)        logger -t lizardha  -s "unknown state $STATE"
                  ln -sf /etc/mfs/mfsmaster.shadow.cfg /etc/mfs/mfsmaster.cfg
                  restart_master_server
                  exit 1
                  ;;
esac

However I've since disabled the keepalived conf and will handle failover manually while I consider some sodding big holes in it. There are issues with keepalived grabbing master before the network is active, leading to the metadata being back dated and I have concerns as to what could happen if a node was down for a while, then came back up and grabbed master.

A floating IP with a quorum election would be better, which is not possible with keepalived. Maybe investigate corosync as its alreayd integrated with proxmox.

Or pay skytechnology! If I can demonstrate the improved performance and flexibility to my boss we could probably swing it.

guestisp commented 7 years ago

Or make pressure to sky technology to publish native HA solution, announced from so much time and never released

MarkyM commented 7 years ago

@Blackpaw I work with SkyTechnology on the LizardFS project and would be happy to work with you on showing improved performance with our uRaft solution. mark.mulrainey@lizardfs.com

Zorlin commented 7 years ago

@MarkyM as @guestisp mentioned, it was promised that uRaft would be released and/or open sourced during the next major LizardFS release, at the time believed to be 3.12.0.

Have plans changed? Free uraft is essential to some of my upcoming deployments + is a huge win for open LizardFS over MooseFS.

eriklindahl commented 7 years ago

Agree with @Zorlin. I think everyone understands that things can take longer than expected, but it would be appreciated with a confirmation SkyTechnology will stand by that word, and an updated time estimate.

foysalkayum commented 7 years ago

I think if you the people of SkyTechnology release the HA as opensource , then LizardFS will certainly gain extreme popularity and honor among SDS erena. And in long run the company will get more benefit commercially .To become a successful opensource project , it should be fully open sourced but not with limited set of feature or with commercially locked in features.

blink69 commented 7 years ago

yes, we mentioned that - it was me to be precise :) Currently in 3.12 we will provide native NFS 3.0, 4.0 and pNFS 4.1 support based on Ganesha software - as someone already found that on our sources. Also in 3.12 we will provide support for ACL with AD integration. Our roadmap is based on customer requests for simple reason - we need funds to continue development. Most of it is released to open source straight away to share with everybody our ideas and work - open source is the way to go! Now going back to HA mechanism - we have to postpone that. There is still an ongoing debate which one to release - lizardfs-uraft or build-in HA in master module that is still under construction. We hope that will be happen in next 6 months.

foysalkayum commented 7 years ago

OH great , Thank you for the update. looking forward for next 6 month. Actually I am delaying to put our LizardFS storage in production for a proper HA solution.

Blackpaw commented 7 years ago

Thanks for the update blink, fascinating:

Currently in 3.12 we will provide native NFS 3.0, 4.0 and pNFS 4.1 support based on Ganesha software

VMware users will love that. And perhaps a iscsi plugin in the future?

ACL with AD integration.

Now that is interesting

There is still an ongoing debate which one to release - lizardfs-uraft or build-in HA in master module that is still under construction

I'm happy to wait, very important to get right. And after some experience with floating ip style failover using external tools (keepalived, ucarp), something integrated into the master sounds much more reliable. Look fwd to hearing more detail on that.

If you need alpha/beta testers I'm happy to fire up a few VM's on a vlan to test.

foysalkayum commented 7 years ago

We are already testing a LizardFS setup with around 120TB storage but without HA. Running around 25 VM on LizardFS storage backend but not feeling confident enough to try Keepalived for HA. I am also happy to test any HA on our cluster with VM backend to check the performance.

guestisp commented 7 years ago

+1 for beta test I'm testing right now lizard on a small cluster and the only thing preventing us to put it in production, is missing support for HA

I've seen that official docs is referring to uraft, maybe releasing that as-is for testing purposes, would result in much more tests done by the community

More tests means more bugfix and a more stable product

Anyway, I'm with @blackpaw, an integrated solution could be better than external tools Maybe you could integrate some failover mechanism in master so that there is no need to configure anything, just add as many shadow you want and lizard does the rest (Like ceph, just add MONs , no need to configure failover)

guestisp commented 7 years ago

@psarna may be this is a stupid proposal but why not allowing a shadow to be also a master without reconfiguration ? Something like master-master replication in mysql.

HAProxy could be used in place of keepalived/ucarp, ..... monitoring 2 or more masters. As every shadow is always in sync and with my proposal there are no differences between shadow and master (all shadow can accept connections from clients), a failover is as simple as moving the IP or pointing the HAProxy backend to a different server.

HAProxy is able to run some periodic checks to every backend, so every shadow not connected to master will be removed, something like this:

$(lizardfs-admin metadataserver-status --porcelain backend.server.tld 9421) == "disconnected"

If current active master dies, haproxy will move connections on one of the survived host (also knowing which "shadows" are in sync or not, thanks to the monitoring script wrote above, thus if, in the cluster, additional shadows are out of sync, there won't be choosen by haproxy)

You don't have to run any external script during failover, as all shadow are also masters, thus, no failover is needed.

Anyway, if any master is able to maitain a list of connected shadow (and with shadows able to accept connections, like my proposal), mfsmount can be customized to be able to automarically reconnect to a different shadow automatically. On first connection, mfsmount fetchs a list of shadow nodes from the master node, then it does the same every X seconds to stay updated in case of new nodes (or removed nodes) A failover, in this way, will be managed natively by mfsmount. mfsmount is already able to reconnect to a master, but in this way it will be able to reconnect to a different one. Which one? Probably the one with higher metadata version number.

4Dolio commented 7 years ago

I'm using corosync+pacemaker for failover and it has worked great for the last two years. I am all for better integration but also want to ask that backwards compatibility be maintained for as long as possible. Would love to see shadows be able to do something beyond being a hot standby before seeing built in HA. I am not sure if i like the idea of build in HA or the ability to choose from many other HA options. It all depends on what the built in HA ends up looking like and how well it works.

guestisp commented 7 years ago

I'm using corosync+pacemaker for failover and it has worked great for the last two years.

Did you publish your corosync+pacemaker ? Is that "opensource" ?

Would love to see shadows be able to do something beyond being a hot standby before seeing built in HA

Me too.

4Dolio commented 7 years ago

See some old HA issues #184 and #299 and #326 and #369 and #371 and #389 and #617 and ... And https://github.com/4Dolio/lizardfs/tree/4Dolio-corosync-ocf-metadataserver-patch-1/src/ha-cluster which is built on the original LizardFS metadataserver corosync script which almost worked. Please keep in mine that my scripts are not pretty and could be a little insane at times.

See also https://github.com/lizardfs/lizardfs/issues/613#issuecomment-340290619

eriklindahl commented 7 years ago

Thanks a lot for the update blink! No problem with delays, or that some features appear in closed-source first, but given the MooseFS background I think users feel a lot more comfortable knowing you're firmly behind the open source model for all components.

Zorlin commented 7 years ago

Thank you for the update. I'm a little disappointed but I agree with everyone else - it's incredibly important to do it right! I really appreciate the transparency.

zicklag commented 6 years ago

Oh, I thought that HA LizardFS was Open Source already. :confused: I actually decided to go with LizardFS because MooseFS didn't have an Open Source HA solution. I really need a clustered storage solution that I can run on my Docker Swarm. MooseFS and LizardFS look like they are perfect for my use-case, but I need to be able to run it HA.

If anybody could explain how they use an external tool like Keepalived from a high level, to help me understand what is necessary to get an Open Source HA solution running, that would be appreciated.

zicklag commented 6 years ago

I actually just found a LizardFS blog post that says that HA is now open sourced. Is it just not yet documented?

zicklag commented 6 years ago

OK, I found the documentation, but the lizardfs-uraft package doesn't exist. Hopefully there is a way to download/build it. :crossed_fingers:

kaitoan2000 commented 5 years ago

Could I contact sale for lizardfs-uraft or any new package?

MarkyM commented 5 years ago

Hi, drop me an email direct to mark.mulrainey@lizardfs.com

On Wed, Jan 2, 2019 at 4:39 AM Nhien notifications@github.com wrote:

Could I contact sale for lizardfs-uraft or any new package?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lizardfs/lizardfs/issues/598#issuecomment-450782398, or mute the thread https://github.com/notifications/unsubscribe-auth/AYDm2cCOjHya_m92QFwQMS8Msb-l_6fTks5u_CnlgaJpZM4PpwFM .

-- Regards,

Mark Mulrainey Storage Dilemma Solver +48 733 187 097 +1 949 299 9454

https://lizardfs.com/ https://www.linkedin.com/company/3295488/ https://twitter.com/LizardFS https://www.facebook.com/lizardfs/