Error during a failover

arielzach commented 1 year ago

Hello,

I'm using Elasticache with the cluster-require-full-coverage config set to 'no.' When I perform a failover on a 3-node cluster, I notice that the client stops reading. It should continue reading from the other two nodes until the failover completes execution.

rueian commented 1 year ago

Hi @arielzach,

Could you provide more details about your setup? Such as your rueidis version and the roles of your 3 redis nodes. Are the two nodes replicas of the failed node?

arielzach commented 1 year ago

We are using version 1.0.19, and we have a cluster with 3 masters and 3 replicas (Redis 7). When I execute failover on one of the nodes, the others remain available and should continue to receive traffic.

rueian commented 1 year ago

Hi @arielzach. I do find that elasticache’s configuration endpoint may keep responding old CLUSTER SHARDS information until reconnection.

However, I was hit by the failover rate limit, which only allows 5 failovers per day, so this may takes me more time to fix.

arielzach commented 1 year ago

If you need me to test any version, let me know; in my case, I haven't seen the 5 failover warning again.

rueian commented 1 year ago

Hi @arielzach, I have pushed the v1.0.20-pre.1 tag which is based on https://github.com/redis/rueidis/pull/387. It will periodically and proactively refresh shards from the configuration endpoint and should fix your issue. Could you have a try?

arielzach commented 1 year ago

Hello @rueian, I just tried it, it worked perfectly. Failover is now almost instantaneous, and it continues writing to the other nodes. The issue I'm noticing is that it's constantly executing the CLUSTER SLOTS/SHARDS command, at a rate of 60 RPM per node, which is quite high considering that, for example, we have applications with 200 nodes, resulting in 12k RPM.

rueian commented 1 year ago

The issue I'm noticing is that it's constantly executing the CLUSTER SLOTS/SHARDS command, at a rate of 60 RPM per node

Yes, you are right. The v1.0.20-pre.1 refreshes shards periodically. I will see if there is a better way to do that.

rueian commented 1 year ago

Hi @arielzach, the v1.0.20-pre.2 will not refresh shards periodically but refresh when the underlying connections are closed unexpectedly, such as timeouts and connection errors. This will reduce the unnecessary CLUSTER SLOTS/SHARDS command but may take about 5~10 seconds to react to the topology changes. Would you like to have a try?

arielzach commented 1 year ago

Hello, I've been running tests, and it's not working correctly. I tried it several times, and sometimes it reconnects without any issues, while other times it simply loses the connection, and no more traffic is visible in the cluster.

arielzach commented 1 year ago

The tests I mentioned earlier were done with DisableRetry=true. Tomorrow, I will conduct tests with DisableRetry=false on AWS. I conducted local tests with Redis in cluster mode with 3 nodes and "cluster-require-full-coverage no." When I killed a node, with DisableRetry=true, the client didn't reestablish the connection. On the other hand, with DisableRetry=false, it worked perfectly, even when restarting the entire cluster.

rueian commented 1 year ago

Hi @arielzach,

Thank you very much for trying v1.0.20-pre.2 and your result helps a lot!

I tried reproducing your local setup with v1.0.20-pre.2 and found it not only sometimes got stuck but also leaked goroutines.

Please try v1.0.20-pre.3 instead. Those issues should be fixed.

arielzach commented 1 year ago

I've been running local tests, with DisableRetry=true, I still encounter the same issue. When I kill a node, the connection is lost. With DisableRetry=false, it works, but not perfectly. In a 4-node cluster, the traffic is 14k RPM on each node. When I kill a node, the client reconnects, but the traffic drops to 8k RPM per node. It should remain at 14k RPM.

rueian commented 1 year ago

Hi @arielzach,

I couldn't reproduce the traffic drop you mentioned with my local Redis cluster. Did all your commands fall into the killed node? If all your goroutines are stuck sending commands to the same dead node, you could observe a RPM drop.

I have tested the v1.0.20-pre.3 many times with my local Redis cluster which had 3 masters and 3 slaves with the following program:

package main

import (
    "context"
    "fmt"
    "net"
    "os"
    "sync/atomic"
    "time"

    "github.com/redis/rueidis"
)

func main() {
    client, err := rueidis.NewClient(rueidis.ClientOption{
        InitAddress:  []string{"127.0.0.1:7001"},
        DisableRetry: true,
        Dialer:       net.Dialer{Timeout: 5 * time.Second},
    })
    if err != nil {
        panic(err)
    }
    defer client.Close()

    resps := uint64(0)
    fails := uint64(0)

    for _, k := range []string{"1232", "5678", "7777"} { // 3 keys located on different nodes.
        go func(s string) {
            for {
                if err := client.Do(context.Background(), client.B().Get().Key(s).Build()).Error(); err == nil || rueidis.IsRedisNil(err) {
                    atomic.AddUint64(&resps, 1)
                } else {
                    atomic.AddUint64(&fails, 1)
                }
            }
        }(k)
    }

    for {
        time.Sleep(time.Second)
        fmt.Println(time.Now())
        fmt.Println("  resps", atomic.LoadUint64(&resps))
        fmt.Println("  fails", atomic.LoadUint64(&fails))
        atomic.StoreUint64(&resps, 0)
        atomic.StoreUint64(&fails, 0)
    }
}

My result showed that the program could maintain 40000RPS for each living node during failover.

Could you also provide a sample program for me to test with? In addition to that, the new v1.0.20-pre.4 is out, containing a fix to a parsing bug I found during the above experiment.

rueian commented 1 year ago

Update: The new v1.0.20-pre.5 further reduces temporary goroutines during nodes down.

arielzach commented 1 year ago

With the latest version (commit fa314529ddf840763da5231b882daf45e3c3104a, go.mod github.com/redis/rueidis v1.0.20-pre.4.0.20231014100953-fa314529ddf8), I couldn't get it to work correctly using my code. This is the code I'm using for testing:


package main

import (
    "context"
    "crypto/tls"
    "fmt"
    "math/rand"
    "strings"
    "time"

    "github.com/redis/rueidis"
)

func main() {
    ctx := context.Background()
    client := getRueidisClientLocal()

    for {
        key := randString(10)
        _, err := client.Do(ctx, client.B().Get().Key(key).Build()).ToString()
        if err != nil {
            if err.Error() != "redis nil message" {
                println("ERROR:", err.Error())
                time.Sleep(time.Second)
            }
        }

    }
}

func getRueidisClientLocal() rueidis.Client {
    client, err := rueidis.NewClient(rueidis.ClientOption{
        InitAddress:      []string{"localhost:42000", "localhost:42001", "localhost:42002"},
        MaxFlushDelay:    0 * time.Microsecond,
        DisableCache:     true,
        ConnWriteTimeout: 2 * time.Second,
        DisableRetry:     true,
    })

    if err != nil {
        panic(err)
    }

    return client
}

func randString(n int) string {
    var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789")

    b := make([]rune, n)
    for i := range b {
        b[i] = letters[rand.Intn(len(letters))]
    }

    return strings.TrimSpace(string(b))
}

I am monitoring the traffic with RedisInsight.

I tested your code, and it works correctly. What would be the difference with mine?

rueian commented 1 year ago

Hi @arielzach,

As far as I can tell, your test program's only loop will be constantly stuck on keys belonging to your dead node. That's why you would observe a total traffic drop when a node is down.

To avoid this, you can fire each request from separate goroutines.

arielzach commented 1 year ago

You are absolutely right, I changed my code and conducted local and AWS tests, everything worked perfectly, thank you.

nikhillimaje commented 3 weeks ago

Hey! I was testing the same on my local Redis cluster having 3 master and 3 replicas with the following configuration set in the cluster: cluster-allow-reads-when-down: yes cluster-require-full-coverage: no

In my local testing (same as above performing reads), I noticed that the rueidis client doesn't perform automatic failover when either a master or replica goes down. Following are the observations in different scenarios:

SendToReplicas=false: Rueidis connects to the master initially and serves reads from the master. When the master goes down, for subsequent requests, it still attempts to connect to the same master node and results in connection error (even though the slave has become a master). Here, I was expecting it to connect to the replica (new master) instead. Now when I bring up the master node, rueidis gets the is_moved error and then it connects to the latest master.
SendToReplicas=true: Rueidis connects to the replica initially and serves reads from the replica. When the replica goes down, it still attempts to connect to the same replica and results in connection error. Doesn't attempt to connect to the master node.

In either case, the reads never recover. Following is the code for testing:

package main

import (
    "context"
    "fmt"
    "net"
    "sync/atomic"
    "time"

    "github.com/redis/rueidis"
)

func main() {
    client, err := rueidis.NewClient(rueidis.ClientOption{
        InitAddress:  []string{"127.0.0.1:7000","127.0.0.1:7001","127.0.0.1:7002","127.0.0.1:7003","127.0.0.1:7004","127.0.0.1:7005"},
        DisableRetry: true,
        ConnWriteTimeout: 200*time.Millisecond,
        //SendToReplicas: func(cmd rueidis.Completed) bool {
        //  return cmd.IsReadOnly()
        //},
        //ReplicaOnly: true,
        Dialer: net.Dialer{
            Timeout: 1 * time.Millisecond,
        },
        ClusterOption: rueidis.ClusterOption{
            ShardsRefreshInterval: 10*time.Millisecond,
        },
    })
    if err != nil {
        panic(err)
    }
    defer client.Close()

    qps := uint64(0)
    resps := uint64(0)
    fails := uint64(0)
    keys := [1]int{3}

    for _, i := range keys {
        key := fmt.Sprint("key", i)
        fmt.Println("key is ", key)
        for j:=0; j<10; j++ {
            go func(s string) {
                for {
                    atomic.AddUint64(&qps, 1)
                    _, err := client.Do(context.Background(), client.B().Get().Key(s).Build()).ToString()
                    if err == nil {
                        atomic.AddUint64(&resps, 1)
                    } else {
                        atomic.AddUint64(&fails, 1)
                    }
                }
            }(key)
        }
    }

    for {
        time.Sleep(time.Second)
        fmt.Println(time.Now())
        fmt.Println("  qps", atomic.LoadUint64(&qps))
        fmt.Println("  resps", atomic.LoadUint64(&resps))
        fmt.Println("  fails", atomic.LoadUint64(&fails))
        fmt.Println()
        fmt.Println()
        atomic.StoreUint64(&qps, 0)
        atomic.StoreUint64(&resps, 0)
        atomic.StoreUint64(&fails, 0)
    }
}

rueian commented 3 weeks ago

Hi @nikhillimaje, thank you for the report. We have released v1.0.49-alpha.1, containing a fix of failover https://github.com/redis/rueidis/pull/660. Please give it a try.

Also, please be ware that this could be too short:

Dialer: net.Dialer{
    Timeout: 1 * time.Millisecond,
}

and you better also check if the err is a redis nil response:

if err == nil || rueidis.IsRedisNil(err) {
    atomic.AddUint64(&resps, 1)
} else {
    atomic.AddUint64(&fails, 1)
}

nikhillimaje commented 3 weeks ago

Thanks for the quick fix @rueian. I tested it out locally - with config SendToReplica=false, the failover is automatic when the master goes down. With SendToReplica=true (in 1 replica cluster setup), if the replica goes down, the reads are not directed to the master. But I guess that is expected as the commands explicitly says to send it to the replica?

rueian commented 3 weeks ago

Thanks for the quick fix @rueian. I tested it out locally - with config SendToReplica=false, the failover is automatic when the master goes down. With SendToReplica=true (in 1 replica cluster setup), if the replica goes down, the reads are not directed to the master. But I guess that is expected as the commands explicitly says to send it to the replica?

Actually, no. It should fall back to using masters. Thank you for testing, please try v1.0.49-alpha.2 again.

nikhillimaje commented 3 weeks ago

Thank you for the new version, but it didn't work as expected i.e. it is not falling back to using masters. Also, with SendToReplica=false, the automatic failover doesn't work all the time when the master goes down.

rueian commented 3 weeks ago

Thank you for the new version, but it didn't work as expected i.e. it is not falling back to using masters. Also, with SendToReplica=false, the automatic failover doesn't work all the time when the master goes down.

Hi @nikhillimaje, it may take ~20 seconds for redis servers to update the node status. For masters, it may take even longer. Did you wait for the failover a bit when you tested?

Here is my test. it took 20 seconds to mark the replica as "fail". (00:30~00:50). https://github.com/user-attachments/assets/413b2e89-95a6-4e00-8068-4f3052c4f6fc

nikhillimaje commented 3 weeks ago

Hi @rueian, I'll try it today again and will update it here. Maybe it was a local problem with my machine. I also wanted to check if it is possible to route the reads to both master & replica unless SendToReplica is set explicitly to true.

rueian commented 3 weeks ago

You can make SendToReplicas to return false randomly for read-only commands.

nikhillimaje commented 3 weeks ago

Hi @rueian , I just tried the above version and it is working as expected when the master or slave goes down (with SendToReplicas set either to true or false).

There is only one scenario where the automatic failover is not working - it is when I create a fresh cluster (3 master & 3 replicas) and take down the master(s) in the first run (with SendToReplicas=false). In the same run, if I bring back the failed masters (which are now slave) and then take down the other masters, then it works. It's weird, but might be machine related?

rueian commented 3 weeks ago

take down the master(s)

Did you take down all the masters at the same time? You need at least two masters alive to failover a failed master to a replica. Replicas will not do failover by themselves.

nikhillimaje commented 3 weeks ago

Did you take down all the masters at the same time?

Nope. There was a gap of a few mins before taking down the other one.

rueian commented 3 weeks ago

then it works. It's weird

Could you explain what you expected?

nikhillimaje commented 3 weeks ago

Could you explain what you expected?

The automatic failover to the replica. Isn't that expected?

Again, this is only observed during the 1st master failure after creating a cluster.

rueian commented 3 weeks ago

Could you explain what you expected?

The automatic failover to the replica. Isn't that expected?

Again, this is only observed during the 1st master failure after creating a cluster.

I see. I have reproduced the issue with redis 7.2.4: redis responds with an empty slots array in the event of the first master failure.

Normally it should be a slot ranges like this:

I think this looks more like a server bug. What version of redis did you use?

nikhillimaje commented 3 weeks ago

I'm using the Redis 7.2.6.

rueian commented 3 weeks ago

I'm using the Redis 7.2.6.

Got it. Unfortunately, it was a server bug that had been fixed recently in the latest version of redis and valkey. See https://github.com/redis/rueidis/pull/664.

I make the rueidis (v1.0.49-alpha.3) to use the old CLUSTER SLOTS for cluster topology if the cluster version < 8.

nikhillimaje commented 3 weeks ago

Thanks a lot for fixing! I will test out the new version. Once verified, would it be possible to release the fixes under version v.1.0.49?

nikhillimaje commented 3 weeks ago

Hi @rueian, I just tested the alpha.3 version. The failover is working as expected in the event of first master failure.

This also fixed the following issue that I was struggling to solve with alpha.2. We majorly use DoMulti() command in our prod code. So I was testing with keys that belonged to each node in the local cluster (3 nodes & 0 replicas). When I took down the node, the entire DoMulti() method would fail with error "the slot has no redis node". This is because the conn returned is nil for the key that belonged to the failed node. The following code in _pickMulti() specifically:

        for _, cmd := range multi {
        if cmd.Slot() == cmds.InitSlot {
            continue
        }
        p := c.pslots[cmd.Slot()]
        if p == nil {
            return nil, 0, false
        }
        count.m[p]++
    }

This issue is fixed with alpha.3 version. Two kills in one shot ;) Thanks a lot!

rueian commented 3 weeks ago

Hi @nikhillimaje, thank you for your help on testing. v1.0.49 is released.

nikhillimaje commented 2 weeks ago

Thanks @rueian!

redis / rueidis

Error during a failover #384