spotahome / redis-operator

Redis Operator creates/configures/manages high availability redis with sentinel automatic failover atop Kubernetes.
Apache License 2.0
1.5k stars 356 forks source link

Graceful shutdown #504

Closed samof76 closed 1 year ago

samof76 commented 1 year ago

Graceful shutdown of redis is different for different version of redis...

6.2.x expects you call the FAILOVER in the master redis, and it does the right thing, as documented here. Except that you would not call BGSAVE yourself. 7.x the SIGTERM is handled gracefully by redis and it does the right things for you as documented here

The current method of calling failover on sentinel might not be very graceful, as you can see from my notes on the redis code.

sentinel.c

sentinelCommand() 
ri denotes redis instance
> Search for "SENTINEL FAILOVER"
> sentinelSelectSlave -> checks for eligible candidates for failover
> sentinelStartFailover -> sets the appropriate state flags for failover
    > failover_state -> SENTINEL_FAILOVER_STATE_WAIT_START
    > flags -> SRI_FAILOVER_IN_PROGRESS

> Then following code acts a state machine on those initial flags and states

void sentinelFailoverStateMachine(sentinelRedisInstance *ri) {
    serverAssert(ri->flags & SRI_MASTER);

    if (!(ri->flags & SRI_FAILOVER_IN_PROGRESS)) return;

    switch(ri->failover_state) {
        case SENTINEL_FAILOVER_STATE_WAIT_START:
        // check for the leader for the failover epoch
        // sets failover_state -> SENTINEL_FAILOVER_STATE_SELECT_SLAVE
            sentinelFailoverWaitStart(ri);
            break;
        case SENTINEL_FAILOVER_STATE_SELECT_SLAVE:
        // select a "slave" for the promotion
        // "slave" selection in sentinelSelectSlave function ...
                //
                // this function in turn call `compareSlavesForPromotion`
                // which orders the slaves in the right category of
                // choice and if priority is the same, select the slave
                // with greater replication offset (processed more data 
                // from the master) and If the replication offset is the 
                // same select the slave with that has the lexicographically
                // smaller runid.
                //
        // ... and now chooses the 0th index
        // slave->flags |= SRI_PROMOTED
        // ri->promoted_slave = slave
        // ri->failover_state = SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE
            sentinelFailoverSelectSlave(ri);
            break;
        case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:
        // check if the selected slave is reachable otherwise abort
        // set the promoted_slave as the slave of no one
        // ri->failover_state = SENTINEL_FAILOVER_STATE_WAIT_PROMOTION
            sentinelFailoverSendSlaveOfNoOne(ri);
            break;
        case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:
        /* We actually wait for promotion indirectly checking with INFO when the
     * slave turns into a master. */
            sentinelFailoverWaitPromotion(ri);
            break;
        case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:
     /* Send SLAVE OF <new master address> to all the remaining slaves that
     * still don't appear to have the configuration updated. */
        sentinelFailoverReconfNextSlave(ri);
            break;
    }
}

One of the key issue with this model is that it does not wait for the replicas to sync the data, before failover and might lead to dataloss. Here is something i found on the bitnami charts repo.

. /opt/bitnami/scripts/libvalidations.sh
. /opt/bitnami/scripts/libos.sh

run_redis_command() {
    if is_boolean_yes "$REDIS_TLS_ENABLED"; then
        redis-cli -h 127.0.0.1 -p "$REDIS_TLS_PORT" --tls --cert "$REDIS_TLS_CERT_FILE" --key "$REDIS_TLS_KEY_FILE" --cacert "$REDIS_TLS_CA_FILE" "$@"
    else
        redis-cli -h 127.0.0.1 -p ${REDIS_PORT} "$@"
    fi
}
failover_finished() {
    REDIS_ROLE=$(run_redis_command role | head -1)
    [[ "$REDIS_ROLE" != "master" ]]
}

# redis-cli automatically consumes credentials from the REDISCLI_AUTH variable
[[ -n "$REDIS_PASSWORD" ]] && export REDISCLI_AUTH="$REDIS_PASSWORD"
[[ -f "$REDIS_PASSWORD_FILE" ]] && export REDISCLI_AUTH="$(< "${REDIS_PASSWORD_FILE}")"

if ! failover_finished; then
    echo "I am the master pod and you are stopping me. Stopping client connections"
    # Stopping client write connections to avoid data loss
    run_redis_command CLIENT PAUSE "{{ mul (sub .Values.sentinel.terminationGracePeriodSeconds 10) 1000 }}" WRITE
    # if I am the master, issue a command to failover once and then wait for the failover to finish
    echo "Waiting for sentinel to run failover for up to {{ sub .Values.sentinel.terminationGracePeriodSeconds 10 }}s"
    retry_while "failover_finished" "{{ sub .Values.sentinel.terminationGracePeriodSeconds 10 }}" 1
else
    exit 0
fi

I was thinking maybe the prestop script could be an option to both redis and sentinel, in the CRD. Any thoughts?

samof76 commented 1 year ago

@ese i am implementing starting implement, customPrestop for both redis and sentinel specs, unless you have some other opinion.

Also cc: @Wouter0100 your thoughts on this?

Wouter0100 commented 1 year ago

Personally I'd say we should ship the correct graceful shutdown depending on the Redis version and that the operator should automatically select the correct one.

The operator is designed to remove the management of Redis, and if we still need to make changes to the spec depending on the version - it feels weird.

samof76 commented 1 year ago

Partially agree that the operator would need to be opinionated about certain things. But there might be sometimes, like in this case... where in the use might want to do multiple things like, ensure the replica sync is done, rdb snapshot it take, possibly notify another service, etc. that way it's better to allow the user define a script.

Wouter0100 commented 1 year ago

I def also agree with the option to add custom things, yeah :)

samof76 commented 1 year ago

Oops! We already have a reference to config map containing shutdown script. Closing this.