Why 31-second sleep after failover in default shutdown script?

hashemi-soroush commented 1 year ago

Hi.

In the default shutdown script, there is a 31-second sleep after the failover command (if the instance is the master). Why is this 31 seconds? Isn't it too much? I did a quick skim of documentation and used git blame to see the commit message hoping it can explain a bit more, but no luck.

miklezzzz commented 1 year ago

that is a good question https://github.com/spotahome/redis-operator/pull/434/files/fbab29deebace9782c98b21694d9c161311e4e76#r1058953059, I also wouldn't mind of some details even though executing SAVE with the terminationGraceLimit of 30 seconds is also rather debatable.

samof76 commented 1 year ago

@Sayed-Soroush-Hashemi I think this has made though somehow, may be a testing regression. But in our house we use custom script like the following...

#!/bin/sh
set -x
run_redis_command() {
    redis-cli -h 127.0.0.1 -p ${REDIS_PORT} "$@"
}

is_master() {
    REDIS_ROLE=$(run_redis_command role | head -1)
    [ "$REDIS_ROLE" = "master" ]
}

# redis-cli automatically consumes credentials from the REDISCLI_AUTH variable
[ -n "$REDIS_PASSWORD" ] && export REDISCLI_AUTH="$REDIS_PASSWORD"

if is_master; then
    echo "Wait in case of rolling update for others come up 1m"
    sleep 60
    echo "I am the master pod and you are stopping me. Stopping client connections for next 10m"
    # Stopping client write connections to avoid data loss
    run_redis_command CLIENT PAUSE 6000000 WRITE
    # issue a commands, (1)KILL all client connections, (2)SAVE RDB, (3)SHUTDOWN
    echo "Wait 10s before killing all connections and issuing SAVE"
    sleep 10
    run_redis_command CLIENT KILL TYPE normal
    # Sleep 1s and do a synchronous save on red
    sleep 1
    run_redis_command SAVE
    # Notify the backup is done and ready for the backup
    touch /data/ok_take_backup
    # Sleep for 5s and shutdown redis
    sleep 5
    run_redis_command SHUTDOWN
else
    exit 0
fi

Couple of things to note here...

We do not do an explicit failover, when the WRITE is pause the, the sentinels automatically depending on their configuration do the failover.
We issue a synchronous SAVE, this make sure that the READ too get blocked even if some READs to land up
ok_take_backup is tell our backup sidecar to backup the latest RDB file

samof76 commented 1 year ago

Also 31 second i guess i kept it there because the during restarts, the replica will take a while for it to come online.

hashemi-soroush commented 1 year ago

Also 31 second i guess i kept it there because the during restarts, the replica will take a while for it to come online.

The required time for a replica to come online depends on many parameters, including the load of the node it is scheduled on. So, 31 seconds doesn't fit everybody's infrastructure and requirements. Even though the entire shutdown script is customizable using the corresponding ConfigMap, I suggest changing the default script and taking into account the large variety of infrastructure and requirements as well as the PR comment @miklezzzz kindly mentioned.

In my case, changing 31 to 10 decreased the overall downtime by 20 seconds.

samof76 commented 1 year ago

You are right. will you be able to send in a PR for this? @Sayed-Soroush-Hashemi

miklezzzz commented 1 year ago

It seems to me that there should be something similar to what @samof76 provided in place as a shutdown script. I mean the current logic of shutdown (just with sleep and save) isn't safe. I was able to lose some writes with: 1) having 1 master and 2 slave replicas 2) killing the replicas at the same time 3) writing a couple of changes to the master while replicas are through restart cycle 4) killing the master while replicas are still being recreated 5) the sentinels choose a new master among two replicas and the changes from (3) are lost.

It isn't a concern if the app logic takes into account possible data loss, though.

samof76 commented 1 year ago

@miklezzzz i still prefer to go with a custom script. As there is no one way of doing it, but as you suggest it best to have shutdown script that is closest to least data loss, as that is key, as most clients other than Java based clients have a good reputation of retrying connections.

dbackeus commented 1 year ago

An observation is that some of these issues are inherent with the design of using a single StatefulSet per cluster. A more robust solution used in many other database-operators would be for the operator to spawn one StatefulSet for each instance in a cluster. Then put more logic into the operator to properly handle graceful rollouts of configuration changes. Eg.

apply changes to StatefulSets of slaves
wait for slaves to restart and become available
trigger a failover
evict any remaining connections on the previous master and prevent new connections
apply changes to StatefulSet of previous master

samof76 commented 1 year ago

@dbackeus i am sure i understand. the ondelete upgrade strategy helps in this solving point 1 and 2, and rest is anyways the same when it come to triggering the failover. right?

samof76 commented 1 year ago

@dbackeus @miklezzzz do we know of anyone other than @ese in the spotahome core team?

miklezzzz commented 1 year ago

nope, I'm pretty new here but we use this operator heavily

dbackeus commented 1 year ago

@samof76 My point is that if the operator took more control of the rollout / deployment logic there wouldn't be any need for arbitrary sleep intervals and praying that a shutdown script does the right thing before the kubernetes gracetime forcefully terminates everything. You could implement a highly deterministic state machine to do the rollouts correctly in any cluster where the operator is deployed.

I don't know anyone on the core team but we use the operator in our Kubernetes environment since it's the only maintained open source one we found supporting the Sentinel approach for HA.

samof76 commented 1 year ago

@dbackeus we need to have all cases covered. Thats why we should ensure we have the right shutdown script in place. This will definitely protect from accidental deletes of the master pod.

hashemi-soroush commented 1 year ago

In the absence of the core developers, my suggestion is, in order to make it a productive discussion, we decide on the target of this script and go on from there.

So far, only one target has been proposed: Minimum data loss. In my use case, we are after another target: maximum availability. I suggest we all look at our use cases and say what we need from this script and why we think most of this operator's users need that. When we have everybody's opinion, if the core developers haven't shown up yet, we will think of a way to proceed.

hashemi-soroush commented 1 year ago

I'll go first: In our use case, we need high availability. For us, availability means that the master accepts connections. In my opinion, since Redis is designed as an in-memory database, fewer people are after minimum data loss than high availability.

samof76 commented 1 year ago

I will explain two scenarios here.

Ungraceful shutdown of the master, i.e., a SIGKILL on the master pod

In this case the prestop hook is useless, and we should either rely on the operator, or sentinel to step in take to decision of who will be the master from the replicas available. Ideally I would prefer the sentinel to act on this case, and only if the sentinel is unable for some reason, the operator to step. And here there is not need for thinking about the prestop hook.
Graceful shutdown of the master, i.e., a SIGTERM or SIGINT on the master pod

Redis 6 still has some gaps on handling the SIGTERM and Redis 7 does this much better(look below for what happens on). Since we are still on Redis 6, we need to do most of the stuff what Redis 7 does on our behalf in the prestop hook, and the script provided by in the above comments is the best effort towards that.

Some points on the above give script?

why do we have the initial sleep 60?

To ensure the replicas have come up and have replicated the data.

why pause write and what happens

pause the writes so we are only reading, and does not lead to data loss. the behaviour on pause write is strange as the sentinel too responds with a failover when we do that, after waiting for down-after-milliseconds. now at this point too if there is no failover, the synchronous bgsave will do that for you. So typically the downtime is what you configure in down-after-milliseconds.

The Shutdown performed in this condition includes the following actions:

If there are any replicas lagging behind in replication:
- Pause clients attempting to write by performing a CLIENT PAUSE with the WRITE option.
- Wait up to the configured shutdown-timeout (default 10 seconds) for replicas to catch up with the master's replication offset.
If there is a background child saving the RDB file or performing an AOF rewrite, the child is killed.
If the AOF is active, Redis calls the fsync system call on the AOF file descriptor in order to flush the buffers on disk.
If Redis is configured to persist on disk using RDB files, a synchronous (blocking) save is performed. Since the save is performed in a synchronous way no additional memory is used.
If the server is daemonized, the pid file is removed.
If the Unix domain socket is enabled, it gets removed.
The server exits with an exit code of zero.

samof76 commented 1 year ago

@Sayed-Soroush-Hashemi so the point the availability will depend on the two things...

Replicas are available in time
Sentinels are able reach a consensus to elect on replica as the master in the down-after-milliseconds
The time taken for the operator to step to take the decision, totally dependent the master reconciliation loop, sentinel evaluation etc.

samof76 commented 1 year ago

Here are some of my notes from reading redis code...

How does the SENTINEL FAILOVER work?

# sentinel.c

# sentinelCommand() 
# ri denotes redis instance
# > Search for "SENTINEL FAILOVER"
# > sentinelSelectSlave -> checks for eligible candidates for failover
# > sentinelStartFailover -> sets the appropriate state flags for failover
#   > failover_state -> SENTINEL_FAILOVER_STATE_WAIT_START
#   > flags -> SRI_FAILOVER_IN_PROGRESS

# > Then following code acts a state machine on those initial flags and states

void sentinelFailoverStateMachine(sentinelRedisInstance *ri) {
    serverAssert(ri->flags & SRI_MASTER);

    if (!(ri->flags & SRI_FAILOVER_IN_PROGRESS)) return;

    switch(ri->failover_state) {
        case SENTINEL_FAILOVER_STATE_WAIT_START:
        // check for the leader for the failover epoch
        // sets failover_state -> SENTINEL_FAILOVER_STATE_SELECT_SLAVE
            sentinelFailoverWaitStart(ri);
            break;
        case SENTINEL_FAILOVER_STATE_SELECT_SLAVE:
        // select a "slave" for the promotion
        // "slave" selection in sentinelSelectSlave function ...
                //
                // this function in turn call `compareSlavesForPromotion`
                // which orders the slaves in the right category of
                // choice and if priority is the same, select the slave
                // with greater replication offset (processed more data 
                // from the master) and If the replication offset is the 
                // same select the slave with that has the lexicographically
                // smaller runid.
                //
        // ... and now chooses the 0th index
        // slave->flags |= SRI_PROMOTED
        // ri->promoted_slave = slave
        // ri->failover_state = SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE
            sentinelFailoverSelectSlave(ri);
            break;
        case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:
        // check if the selected slave is reachable otherwise abort
        // set the promoted_slave as the slave of no one
        // ri->failover_state = SENTINEL_FAILOVER_STATE_WAIT_PROMOTION
            sentinelFailoverSendSlaveOfNoOne(ri);
            break;
        case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:
        /* We actually wait for promotion indirectly checking with INFO when the
     * slave turns into a master. */
            sentinelFailoverWaitPromotion(ri);
            break;
        case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:
     /* Send SLAVE OF <new master address> to all the remaining slaves that
     * still don't appear to have the configuration updated. */
        sentinelFailoverReconfNextSlave(ri);
            break;
    }
}

How does FAILOVER work?

/* replication.c
 * FAILOVER [TO <HOST> <PORT> [FORCE]] [ABORT] [TIMEOUT <timeout>]
 * 
 * This command will coordinate a failover between the master and one
 * of its replicas. The happy path contains the following steps:
 * 1) The master will initiate a client pause write, to stop replication
 * traffic.
 * 2) The master will periodically check if any of its replicas has
 * consumed the entire replication stream through acks. 
 * 3) Once any replica has caught up, the master will itself become a replica.
 * 4) The master will send a PSYNC FAILOVER request to the target replica, which
 * if accepted will cause the replica to become the new master and start a sync.
 * 
 * FAILOVER ABORT is the only way to abort a failover command, as replicaof
 * will be disabled. This may be needed if the failover is unable to progress. 
 * 
 * The optional arguments [TO <HOST> <IP>] allows designating a specific replica
 * to be failed over to.
 * 
 * FORCE flag indicates that even if the target replica is not caught up,
 * failover to it anyway. This must be specified with a timeout and a target
 * HOST and IP.
 * 
 * TIMEOUT <timeout> indicates how long should the primary wait for 
 * a replica to sync up before aborting. If not specified, the failover
 * will attempt forever and must be manually aborted.
 */
// failoverCommand() sets the server.failover_state = FAILOVER_WAIT_FOR_SYNC
// and it pauses all connections

// replicationCron(void), runs every second > updateFailoverStatus()
// updateFailoverStatus operates FAILOVER_WAIT_FOR_SYNC
//// checks for timeout if any
//// no target is given in the failover then computes the target with repl_offset
//// when replica has caught up, failover_state is set to FAILOVER_IN_PROGRESS
//// call replicationSetMaster(host, port)
void failoverCommand(client *c) {
    if (server.cluster_enabled) {
        addReplyError(c,"FAILOVER not allowed in cluster mode. "
                        "Use CLUSTER FAILOVER command instead.");
        return;
    }

    /* Handle special case for abort */
    if ((c->argc == 2) && !strcasecmp(c->argv[1]->ptr,"abort")) {
        if (server.failover_state == NO_FAILOVER) {
            addReplyError(c, "No failover in progress.");
            return;
        }

        abortFailover("Failover manually aborted");
        addReply(c,shared.ok);
        return;
    }

    long timeout_in_ms = 0;
    int force_flag = 0;
    long port = 0;
    char *host = NULL;

    /* Parse the command for syntax and arguments. */
    for (int j = 1; j < c->argc; j++) {
        if (!strcasecmp(c->argv[j]->ptr,"timeout") && (j + 1 < c->argc) &&
            timeout_in_ms == 0)
        {
            if (getLongFromObjectOrReply(c,c->argv[j + 1],
                        &timeout_in_ms,NULL) != C_OK) return;
            if (timeout_in_ms <= 0) {
                addReplyError(c,"FAILOVER timeout must be greater than 0");
                return;
            }
            j++;
        } else if (!strcasecmp(c->argv[j]->ptr,"to") && (j + 2 < c->argc) &&
            !host) 
        {
            if (getLongFromObjectOrReply(c,c->argv[j + 2],&port,NULL) != C_OK)
                return;
            host = c->argv[j + 1]->ptr;
            j += 2;
        } else if (!strcasecmp(c->argv[j]->ptr,"force") && !force_flag) {
            force_flag = 1;
        } else {
            addReplyErrorObject(c,shared.syntaxerr);
            return;
        }
    }

    if (server.failover_state != NO_FAILOVER) {
        addReplyError(c,"FAILOVER already in progress.");
        return;
    }

    if (server.masterhost) {
        addReplyError(c,"FAILOVER is not valid when server is a replica.");
        return;
    }

    if (listLength(server.slaves) == 0) {
        addReplyError(c,"FAILOVER requires connected replicas.");
        return; 
    }

    if (force_flag && (!timeout_in_ms || !host)) {
        addReplyError(c,"FAILOVER with force option requires both a timeout "
            "and target HOST and IP.");
        return;     
    }

    /* If a replica address was provided, validate that it is connected. */
    if (host) {
        client *replica = findReplica(host, port);

        if (replica == NULL) {
            addReplyError(c,"FAILOVER target HOST and PORT is not "
                            "a replica.");
            return;
        }

        /* Check if requested replica is online */
        if (replica->replstate != SLAVE_STATE_ONLINE) {
            addReplyError(c,"FAILOVER target replica is not online.");
            return;
        }

        server.target_replica_host = zstrdup(host);
        server.target_replica_port = port;
        serverLog(LL_NOTICE,"FAILOVER requested to %s:%ld.",host,port);
    } else {
        serverLog(LL_NOTICE,"FAILOVER requested to any replica.");
    }

    mstime_t now = mstime();
    if (timeout_in_ms) {
        server.failover_end_time = now + timeout_in_ms;
    }

    server.force_failover = force_flag;
    server.failover_state = FAILOVER_WAIT_FOR_SYNC;
    /* Cluster failover will unpause eventually */
    pauseClients(LLONG_MAX,CLIENT_PAUSE_WRITE);
    addReply(c,shared.ok);
}

hashemi-soroush commented 1 year ago

@samof76 thanks for the comments. My main issue with your script is long delays with no guarantee that the thing you were waiting for actually happens when the sleep duration is over. Others like @miklezzzz mentioned in the comments the same issue with the current and your script. I think there are ways to check if, for example, a replica is synced with the master. These methods can be used to create a state machine instead of a string of sleeps as @dbackeus suggested.

But, again, I suggest before we start discussing the implementation, we first decide on the target of this script. It'd be great if others could say what they need from this script.

samof76 commented 1 year ago

@Sayed-Soroush-Hashemi thanks for clarifying. You are right we should never sleep at arbitrary lengths, but instead we should base it scientifically. Let me come with something.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

spotahome / redis-operator