Possibility to restart sync after a OUT_OF_SYNC without having to shutdown both servers

saveriocastellano commented 4 years ago

@ideawu I have two MasterA-MasterB SSDBs with sync=mirror and in case they go OUT_OF_SYNC I'd like to ask you if there is any possiblity to avoid shutting down at least one of the nodes. Because I'm only actively using MasterA, then when MasterB goes out of sync from it, instead of having to shutdown both instances (and removing 'data' and 'meta' in MasterB and 'meta' in MasterA) I wounder if it'd be possible to do the following instead:

a) shutdown ONLY MasterB and delete its "meta" and "data" directories b) write specific keys in the "meta" of MasterA to reset status if MasterB, so that MasterA will accept MasterB to start syncing from scratch

regarding b) I had a look at slave.cpp and I see that the status if stored in "meta" under this keys:

name of hash key = "slave.status." + this->id_; values: "last_key", "last_seq"

So I was wondering whether by writing the right key and setting "last_seq" to zero and "last_key" to nothing would let MasterA accepting MasterB to start syncing from scratch without having to restart MasterA.

ideawu commented 4 years ago

Maybe we need add_slave and del_slave command.

ghen2 commented 4 years ago

Or just a reset_meta command so it will rediscover slave(s) status, equal to deleting meta/ directory?

saveriocastellano commented 4 years ago

I'm adding this method in "SSDBServer". I have been looking at the code and it seems to be the correct thing to do:

void SSDBServer::resetSync() {
    log_info("resetting sync state...");
    delete backend_sync;
    backend_sync = new BackendSync(this->ssdb, this->sync_speed);

    std::vector<Slave *>::iterator it;
    for(it = slaves.begin(); it != slaves.end(); it++){
        Slave *slave = *it;
        slave->stop();

        slave->last_seq = 0;
        slave->last_key = "";
        slave->save_status();

        slave->start();
    }

    log_info("sync state reset");
}

Then in "proc_sys.cpp" I added this command which is bound to a new "syncreset" command I defined:

int proc_syncreset(NetworkServer *net, Link *link, const Request &req, Response *resp) {
    SSDBServer *serv = (SSDBServer *)net->data;
    CHECK_NUM_PARAMS(0);
    serv->resetSync();
    return 0;
}

I'm now going to test it and I will write here how it goes.

@ideawu if you see anything wrong in my code or you think it's just not the right way then it'd help me to know it. Thanks

ideawu commented 4 years ago

do not do any thing to BackendSync, it will take effect when any slave disconnected.
Slave is not designated to work after a call to stop(), so you should delete the old one and create a new one. Take a look at https://github.com/ideawu/ssdb/blob/0e93e2aaa018abd2332478002f0664098049b23a/src/proc_sys.cpp#L261
You must not reset all slaves, just del the expected one, and then add a new one.
Execute del_slave and then add_slave operation on both A and B

saveriocastellano commented 4 years ago

alright, so as you said in your previous post this is just a matter of adding a "del_slave" command, for "add_slave" there is no need to add a new command right, I can just use the existing "slaveof" command, correct?

ideawu commented 4 years ago

No, slaveof command does not support mirror replication.

saveriocastellano commented 4 years ago

but then I don't understand why you say:

"Execute del_slave and then add_slave operation on both A and B"

If I'm going to shutdown B and delete its "meta" and "data" directories, why do I need to "delete" and "add" the slave also on B? Node B will already have NodeA defined in its configuration file so it will connect to it when it starts.

ideawu commented 4 years ago

If you shutdown B and delete its meta and data folder, then you don't need to invoke delete and add slave command on it.

saveriocastellano commented 4 years ago

thanks! That's what I thought.

I have implemented the command and tested it. So far it seems to work well. Here is my code, please tell me what you think.

In Serv.cpp I added this:

int SSDBServer::resetslave(const std::string &id) {
    Slave *slave = NULL;
    std::vector<Slave *>::iterator it;
    for(it = slaves.begin(); it != slaves.end(); it++){     
        if ((*it)->get_id()==id) {
            slave = *it;
            slaves.erase(it);
            break;
        }       
    }   
    if (slave) {
        log_info("resetting slave...");
        delete slave;
        this->slaveof(slave->get_id(), slave->get_host(), slave->get_port(), std::string("")/*auth*/, 0/*last_seq*/, std::string("")/*last_key*/, slave->get_is_mirror(), 0);
        slave->start();
        slaves.push_back(slave);    
        log_info("slave reset.");
        return 0;
    } else {
        return -1;
    }
}

In proc_sys.cpp I added this:

int proc_resetslave(NetworkServer *net, Link *link, const Request &req, Response *resp) {
    SSDBServer *serv = (SSDBServer *)net->data;
    CHECK_NUM_PARAMS(1);
    std::string id = req[1].String();   
    int res = serv->resetslave(id);
    if (res<0) {
        resp->push_back("not_found");
    } else {
        resp->push_back("ok");  
    }
    return 0;
}

saveriocastellano commented 4 years ago

The above code seems to work, however in the logs of MasterA (the master on which I have executed the new "resetslave" command) I get this:

2020-02-19 14:37:06.323 [ERROR] slave.cpp(261): the master hasn't responsed for awhile, reconnect... 2020-02-19 14:37:06.336 [INFO ] slave.cpp(187): [localhost|8889][0] connecting to master at localhost:8889... 2020-02-19 14:37:06.337 [INFO ] slave.cpp(216): [localhost|8889] ready to receive binlogs

strangely after restarting MasterB I see that the two nodes are in sync and writing to MasterA does reflect in the same data to be available in MasterB.

Do you know what could be the reason for that error i get in the log ?

saveriocastellano commented 4 years ago

Oh ...I think it's because I forgot to call "slave->stop();" to stop the thread of the slave!

Here is the updated code:


int SSDBServer::resetslave(const std::string &id) {
    Slave *slave = NULL;
    std::vector<Slave *>::iterator it;
    for(it = slaves.begin(); it != slaves.end(); it++){     
        if ((*it)->get_id()==id) {
            slave = *it;
            slaves.erase(it);
            break;
        }       
    }   
    if (slave) {
        log_info("resetting slave...");
        Slave *newSlave = new Slave(ssdb, meta, slave->get_host().c_str(), slave->get_port(), slave->get_is_mirror());
        slave->stop();
        delete slave;
        newSlave->save_status();
        newSlave->start();
        slaves.push_back(newSlave); 
        log_info("slave reset.");
        return 0;
    } else {
        return -1;
    }
}

saveriocastellano commented 4 years ago

hi all,

I managed to get this working by using my intial method which consists in deleting and restarting "BackendSync".

I have been using it in production for the past month and it is working very well: no need to restart nodes when they go OUT_OF_SYNC now

Here is my code: https://github.com/saveriocastellano/ssdb

jgod commented 4 years ago

Very useful idea 👍

rhessing commented 4 years ago

If someone is interested in this usefull function I've patched the code a bit to be able to run with the latest codechanges in the master branch: https://github.com/rhessing/ssdb

Still testing it, but a standalone version does work without issues when running the ssdb benchmark.

@saveriocastellano please let me know if you would like to have a pull request :-)

saveriocastellano commented 3 years ago

@rhessing very good, thanks for letting me know

ideawu / ssdb

Possibility to restart sync after a OUT_OF_SYNC without having to shutdown both servers #1333

2020-02-19 14:37:06.323 [ERROR] slave.cpp(261): the master hasn't responsed for awhile, reconnect... 2020-02-19 14:37:06.336 [INFO ] slave.cpp(187): [localhost|8889][0] connecting to master at localhost:8889... 2020-02-19 14:37:06.337 [INFO ] slave.cpp(216): [localhost|8889] ready to receive binlogs