redis / go-redis

Redis Go client
https://redis.uptrace.dev
BSD 2-Clause "Simplified" License
19.6k stars 2.31k forks source link

Seeing lot of moved errors when Routebylatency is enabled in ClusterClient #3023

Open srikar-jilugu opened 3 weeks ago

srikar-jilugu commented 3 weeks ago

When we benchmarked our elasticache(cluster mode enabled) with Routebylatency option enabled with goredis v9.5.1, we saw increase in average response time in our redis operations(get and pipeline cmds), when we tried to debug this issue and added certain logs in the process, we saw a lot of moved errors that caused retries which in turn increased latency overall.

https://github.com/redis/go-redis/blob/d43a9fa887d9284ba42fcd46d46e97c56b34e132/osscluster.go#L966

In further debugging we observed that slotClosestNode func is returning a random node across all the shards in the case when all the nodes are marked as failed.

https://github.com/redis/go-redis/blob/d43a9fa887d9284ba42fcd46d46e97c56b34e132/osscluster.go#L750

In our case, this situation(where all nodes failing) is happening frequently which is causing frequent moved errors

Expected Behavior

There shouldn't be increase in response time when Routebylatency enabled infact it should decrease if possible and moved errors shouldn't be much once the client's current cluster state is updated.

Current Behavior

Increase in moved errors, hence increase in throughput of Get(with the same traffic), engine cpu utilisation of all the nodes and overall latency.

Possible Solution

In the case when all the nodes are marked as failed, choosing a random node within the shard associated with the slot(even though they are marked as failed) might work for this problem, this is what is done when RouteRandomly is enabled.

Steps to Reproduce

  1. Elasticache cluster (we are using engine 7.1.0) with multiple shards and replicas for each( we used 2 shards with 3 nodes each)
  2. Using go-redis v9.5.1 with RoutebyLatency enabled, throughput around 10-20k rpm with get and pipeline.get
  3. Mulitple ecs tasks(we are using 10) running spread across multiple availability zones

Context (Environment)

Detailed Description

Possible Implementation

We made changes in the slotClosestNode func implementing the fix we thought of, actually reduced the moved errors(and hence response time) when we benchmarked again.

This is the fix we made in our fork.

            for _, n := range nodes {
             if n.Failing() {
             continue
             }
        if node == nil || n.Latency() < node.Latency() {
            node = n
        }
            }

           if node != nil {
            return node, nil
           }

                // If all nodes are failing - return random node from the nodes corresponding to the slot
          randomNodes := rand.Perm(len(nodes))
          return nodes[randomNodes[0]], nil
srikar-jilugu commented 3 weeks ago

@vmihailenco, @monkey92t Can you please look into this issue?

monkey92t commented 3 weeks ago

I believe using nodes that are marked as failed is not a good idea. In most cases, such nodes are unusable, and attempting to use them will result in a predictable error.

monkey92t commented 3 weeks ago

In cluster mode, during a node failover, such situations may occur: the hash slots originally handled by Node A are transferred to Node B. However, the latest cluster state is not retrieved in real time and has some delay.

While redirecting commands to the new node might bring certain issues, such as network connections, it still ensures the execution of commands. If we use a crashed node, it might not yield any effective results, and returning an error directly might be better.

monkey92t commented 3 weeks ago

From the perspective of the Redis server, it expects many clients to perform similar operations; otherwise, the MOVE command would be meaningless. Redis does not guarantee a permanent relationship between nodes and hash slots. When adding or removing nodes, the relationship between hash slots and nodes may change. Redis uses the MOVE command to inform the client that there has been a change in the hash slots within the Redis server. In normal cluster mode, such situations rarely occur because, in the vast majority of cases, the nodes in a Redis cluster are stable, or replicas are used to avoid single-node failures.

srikar-jilugu commented 3 weeks ago

@monkey92t thanks for replying, we were having lot of "all nodes failing" cases when we were benchmarking, this was due to a bug in v9.5.1 where nodes are getting marked as failed even when getting redis: nil errors.(which is fixed in the latest release).

But in the situation where we are getting lot of moved errors, we were not having any shard rebalancing at all(or node failovers), the only root cause was slotClosestNode picking up nodes from other shards when all nodes are marked as failing. This is leading the client to reload its state(even though it isn't changed) and then coming back to a node in the current shard itself (giving it an extra hop) and executing queries on it (which was not picked in the first go as it was failing).

State reloading can be avoided if we choose a random node in the current shard itself and process function would avoid the node if its failing in the 1st retry too (its serving successfully with the retry in our case as failing is temporary)

On the other hand, slotRandomNode (wheRouteRandomly is enabled) already chooses a random node from the same slot even when all nodes are failing.

    randomNodes := rand.Perm(len(nodes))
    for _, idx := range randomNodes {
        if node := nodes[idx]; !node.Failing() {
            return node, nil
        }
    }
    return nodes[randomNodes[0]], nil
srikar-jilugu commented 3 weeks ago

Also Is it possible to add a config to ignore failure of nodes? (i.e consider them while choosing lowest latency node)

monkey92t commented 3 weeks ago

OK, I am trying to understand your point. During your testing, there were no changes to the Redis cluster structure, but the node responsible for a certain range of hash slots experienced a failure (or network issue). As a result, the node corresponding to those hash slots became inaccessible. When go-redis encounters an error while accessing that node, it marks the node as faulty and stops accessing it. However, since all nodes responsible for the hash slots in the cluster have failed, the commands are randomly sent to any available node. Because the cluster structure hasn't changed, the commands are then redirected (MOVE) back to the original faulty node, and go-redis attempts to send the commands to the faulty node again.

monkey92t commented 3 weeks ago

Your solution might be effective, but it will bring about greater side effects. If we force the use of a node that has already failed, it becomes pointless. The reason we mark a node as faulty is to avoid using it until it recovers. If the Redis cluster is performing a normal failover, with a new node taking over the failed node, it will be difficult to discover the new node information without using MOVE, which is the purpose of Redis-server's MOVE response to the client. Additionally, when the node experiences a network failure, accessing it again will worsen an already unhealthy network state.  Go-redis does not know what is happening with Redis-server, whether it is a simple failure or an ongoing failover, so it is difficult to make certain changes. Perhaps we should carefully consider a better solution.

monkey92t commented 3 weeks ago

@srikar-jilugu You can try setting only the ReadOnly parameter; it will select nodes only within the nodes responsible for the hash slot and will not randomly select other nodes.

func (c *clusterState) slotSlaveNode(slot int) (*clusterNode, error) {
    nodes := c.slotNodes(slot)
    switch len(nodes) {
    case 0:
        return c.nodes.Random()
    case 1:
        return nodes[0], nil
    case 2:
        if slave := nodes[1]; !slave.Failing() {
            return slave, nil
        }
        return nodes[0], nil
    default:
        var slave *clusterNode
        for i := 0; i < 10; i++ {
            n := rand.Intn(len(nodes)-1) + 1
            slave = nodes[n]
            if !slave.Failing() {
                return slave, nil
            }
        }

        // All slaves are loading - use master.
        return nodes[0], nil
    }
}

In the above code, only when no corresponding node is found for the hash slot (regardless of whether the nodes are healthy), will a random node be selected. In all other cases, the node corresponding to the hash slot will be returned.

monkey92t commented 3 weeks ago

But doing so deviates from your expected setup; it won't look for nodes that are closer to itself, it will only randomly select a node.

srikar-jilugu commented 3 weeks ago

@monkey92t we are opting in for RouteByLatency so that requests are served from the closest nodes which usually stay in the same AZ as our containers hence reducing latency and network cost as much as possible. If we look at the code for slotRandomNode (i.e when RouteRandomly is opted in), when all the nodes are failing, it selects a random node corresponding hashslot itself, we want the same behaviour for slotClosestNode too

srikar-jilugu commented 3 weeks ago

OK, I am trying to understand your point. During your testing, there were no changes to the Redis cluster structure, but the node responsible for a certain range of hash slots experienced a failure (or network issue)

We weren't experiencing any node failure, we had lot of cache misses when we benchmarked using v9.5.1 which resulted in nodes getting marked as failed

func (c *ClusterClient) pipelineReadCmds(
    ctx context.Context,
    node *clusterNode,
    rd *proto.Reader,
    cmds []Cmder,
    failedCmds *cmdsMap,
) error {
    for i, cmd := range cmds {
        err := cmd.readReply(rd)
        cmd.SetErr(err)

        if err == nil {
            continue
        }

        if c.checkMovedErr(ctx, cmd, err, failedCmds) {
            continue
        }
                // here we are getting redisNil errors because of cache misses.(but this is fixed in v9.5.3)
        if c.opt.ReadOnly {
            node.MarkAsFailing()
        }
srikar-jilugu commented 2 weeks ago

The reason we mark a node as faulty is to avoid using it until it recovers.

@monkey92t I agree that using a bad node can cause an issue, but the library is marking nodes as failed whenever there is a badconnection issue which can be caused by even single intermittent context timeout (and deadline exceeded error) leading to frequent failures. (possibly the logic on when the node should be marked as failed needs to change?)

Also Is it possible to add a config to ignore failure of nodes? (i.e consider them while choosing lowest latency node)

Can you let me know if this can be done, like a configurable option for RouteBylatency(or Readonly) so that we can ignore temporary failure markings like these?

srikar-jilugu commented 2 weeks ago

However, since all nodes responsible for the hash slots in the cluster have failed, the commands are randomly sent to any available node.

We weren't doing this in any other node selection functions: slotRandomNode, slotSlaveNode,slotMasterNode, in all these methods we are choosing either the master or a random node within the same slot when all nodes are failing

monkey92t commented 2 weeks ago

I've recently been reflecting on the request flow in redis-cluster. It seems that we shouldn't randomly select another node to execute commands when all the nodes corresponding to a hash slot are down.

Similar to the ReadOnly parameter, when the nodes are offline, we still choose the node corresponding to the hash slot, rather than randomly selecting another node.

monkey92t commented 2 weeks ago

@vmihailenco Do you understand why we choose a random node?

srikar-jilugu commented 2 weeks ago

@monkey92t @vmihailenco Is there a reason why RouteByLatency and RouteRandomly includes master node during selection? Would it be possible to have node selection among replicas itself (I think slotSlaveNode already covers it for random selection case, but would the same be possible for latency based selection?)

Use Case: We want to have RouteByLatency for our write heavy clients too, but the read requests should only be served by slaves

srikar-jilugu commented 1 week ago

@monkey92t Any update on this issue? if we are onboard on the decision that random selection cannot be done outside the hash slot, will raise a pr for the same.

monkey92t commented 1 week ago

@vmihailenco @ofekshenawa view?

srikar-jilugu commented 1 week ago

@monkey92t @vmihailenco @ofekshenawa Are there any hindrances we are facing regarding this approach?, would love to get your feedback.

monkey92t commented 6 days ago

I am in favor of canceling the random selection of nodes because there seems to be no clear evidence that a migration has occurred in the Redis cluster; it is merely that the nodes are unavailable or there is a network failure. However, I don't understand why the initial implementation required randomly selecting from all nodes. We should use a default node, such as nodes[0], when all nodes responsible for the hash slot are down. 

srikar-jilugu commented 5 days ago

@monkey92t choosing default node could overwhelm the master node(node[0]) in case of temporary failures, instead we could choose the node with the lowest latency itself when all nodes are failing? wdyt?

srikar-jilugu commented 20 hours ago

@monkey92t It looks like the other authors haven't responded yet. It is greatly appreciated if its possible for you to take ownership of this issue and review the PR that I can raise? We currently depend on this fix for new improvements in our redis ecosystem.

monkey92t commented 20 hours ago

@monkey92t It looks like the other authors haven't responded yet. It is greatly appreciated if its possible for you to take ownership of this issue and review the PR that I can raise? We currently depend on this fix for new improvements in our redis ecosystem.

OK, Welcome your PR. Although other maintainers haven't responded, we can proceed with the changes.

monkey92t commented 20 hours ago

If no usable nodes are available, we can randomly choose a node responsible for that hash slot. If a fixed node is selected, go-redis might attempt TCP connections, which could lead to larger failures if there are too many attempts.

monkey92t commented 19 hours ago

A better solution might be to use goroutines to continuously ping the lost nodes and mark them as RUNNING when access is restored. If there are any access requests during this period, directly return an error such as 'ERR: no available nodes'.

srikar-jilugu commented 37 minutes ago

A better solution might be to use goroutines to continuously ping the lost nodes and mark them as RUNNING when access is restored. If there are any access requests during this period, directly return an error such as 'ERR: no available nodes'.

I think we are failing(marking) the nodes for a temporary duration, if the node is able to serve the requests again, it will not be marked as failed further, correct me if I am wrong

monkey92t commented 20 minutes ago

Yes, your idea is not wrong, but it carries a significant risk. If node A goes down and many subsequent requests still use node A, it will send many TCP handshakes or Redis requests, further worsening the situation. If we use a separate goroutine to continuously ping, it will have only one request working instead of a large number of requests.

srikar-jilugu commented 3 minutes ago

Yes, your idea is not wrong, but it carries a significant risk. If node A goes down and many subsequent requests still use node A, it will send many TCP handshakes or Redis requests, further worsening the situation. If we use a separate goroutine to continuously ping, it will have only one request working instead of a large number of requests.

makes sense, we can take aid from updateLatency() function which is run periodically(I think every 10sec afaik)


func (n *clusterNode) updateLatency() {
    const numProbe = 10
    var dur uint64

    successes := 0
    for i := 0; i < numProbe; i++ {
        time.Sleep(time.Duration(10+rand.Intn(10)) * time.Millisecond)

        start := time.Now()
        err := n.Client.Ping(context.TODO()).Err()
        if err == nil {
            dur += uint64(time.Since(start) / time.Microsecond)
            successes++
        }
    }

    var latency float64
    if successes == 0 {
        // If none of the pings worked, set latency to some arbitrarily high value so this node gets
        // least priority.
        latency = float64((1 * time.Minute) / time.Microsecond)
    } else {
        latency = float64(dur) / float64(successes)
    }
    atomic.StoreUint32(&n.latency, uint32(latency+0.5))
}

lets say if a node is genuinely down, this func will be returning constantly high latency, which can be used to ignore these kind of nodes when choosing the node when all nodes are marked as failed. A node which might be temporarily down will eventually respond to a ping and update its latency info with a lower value.