Cluster: Fail to reconnect to node

jcstanaway commented 6 years ago

I'm running a simple test with 3 masters + 1 slave per master. If I take down one of the slaves and restart it, ioredis fails to reconnect to that node.

  ioredis:redis status[127.0.0.1:6421]: ready -> close +7s
  ioredis:connection skip reconnecting because `retryStrategy` is not a function +0ms
  ioredis:redis status[127.0.0.1:6421]: close -> end +0ms
Received -node: "127.0.0.1:6421"

Note, however, that retryStrategy is a function. I've repeated this failure with 1) no options supplied new Cluster, 2) with options, but not setting options.redisOptions.retryFunction and 3) with option, but setting options.redisOptions.retryFunction to function() { return 3000; }.

Even long after the node has been restarted and rejoined the cluster (verified with redis-cli cluster slots and cluster nodes), ioredis continues to report the cluster.nodes().length = 5.

So, the only way to recover from this very basic failure mode is for the application to monitor the '-node' event and destroy and re-create the client. This is a bad defect.

I would have hoped that ioredis was periodically monitoring the cluster configuration by periodically performing a cluster slots command, it is clearly impossible for ioredis to detect if new slaves are ever added to the cluster in an expansion scenario.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

jcstanaway commented 6 years ago

Keep alive. This issue needs to be resolved.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

jcstanaway commented 6 years ago

Keep alive. This issues needs to be resolved.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

jcstanaway commented 6 years ago

Keep alive. This issue still needs to be resolved.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

jcstanaway commented 6 years ago

freshbot

shaharmor commented 6 years ago

Hey @ccs018, as far as I know, ioredis only refreshes the cluster nodes list when there is a slot change.

Can you try to change one of the slots and see if ioredis reconnects to that slave?

You are correct that we should add a refresh interval for the cluster nodes list, I'll see if I can hook something up

jcstanaway commented 6 years ago

@shaharmor , thanks for the response. I could try a slot change, but I believe this also is not immediately detected, but only if an operation results in a MOVED response.

Note that I've raised a couple other related issues regarding how ioredis manages cluster connections. In this particular issue, there is no real change in the cluster topology. Rather one of the nodes failed and ioredis decides to never attempt to reconnect to that node. If you query any node in cluster, that temporarily failed node was never removed from the cluster, yet ioredis decides that it is gone forever. In this scenario, ioredis should not unilaterally remove the node from its view of the cluster, but should continue to retry re-connecting to that node. As I noted above error may be related to the following debug output: ioredis:connection skip reconnecting because `retryStrategy` is not a function +0m yet in my configuration of ioredis, retryStrategy IS a function. I suspect that there is an error in handling the options passed to Cluster().

shaharmor commented 6 years ago

Can you share how you are calling the Redis.Cluster constructor?

jcstanaway commented 6 years ago

I could, but it's gotten rather complex as I've had to work around the various issues I've logged and no longer have the original code. Between this issue and not handling the exception case of trying to connect to the cluster before the cluster is actually formed (startup race condition and the ioredis client hangs) and not proactively handling actual changes to the cluster topology, I've written a lot of logic to detect and recover from these conditions.

I'll try to find some time to write a sample client, but it should be rather easy to reproduce. Just set up a cluster, connect a client and then stop one of the nodes. Wait a minute and then restart the node. You'll see that ioredis does not automatically reconnect to that node.

startupNodes = [{host:127.0.0.1, port:6400},{host:127.0.0.1, port:6410},{host:127.0.0.1, port:6420}] Other nodes in the cluster are 127.0.0.1 ports 6401, 6411 and 6421 and discovered dynamically. Non-default options:

enableOfflineQueue: false
scaleReads: true
redisOptions.connectTimeout: 2000
redisOptions.autoResendUnfulfilledCommands: false
redisOptions.reconnectOnError = function(err) {
    let targetError = 'READONLY';
    if (err.message.slice(0, targetError.length) === targetError) {
        // Only reconnect when the error starts with "READONLY".
        return 2;
    }
};
redisOptions.readOnly: true

I don't recall if I was initially using the builtin for clusterRetryStrategy and/or redisOptions.retryStrategy but you can look at #586 for a related issue.

luin commented 6 years ago

@ccs018 Hi, sorry for the really late response.

According to the official documentation https://redis.io/topics/cluster-spec#clients-first-connection-and-handling-of-redirections, ioredis will refresh the slots only when a MOVED error is received. That means even when a node is down temporarily and it's not removed from the cluster, ioredis will disconnect it and won't reconnect.

I'm wondering whether this should be a problem since once sending a command that belong to this node, ioredis will receive a MOVED error, at which time ioredis refresh the slots information and connect to the lost node.

jcstanaway commented 6 years ago

MOVED is only applicable if slots were moved from one shard to another. There are many scenarios where there's an issue with the current implementation where there is never a MOVED error.

Several scenarios come to mind:

1) read-only slaves. If I want to scale my read performance and enable read operations on slave nodes, then if a slave node restarts (or more like the machine the slave node is on restarts), the slave node is removed by ioredis and never reconnects thus impacting my read performance.

2) high availability: A slave node restarts and ioredis removes it from the cluster list. At some point later, the master goes down. The slave node which previously restart is elected master. The cluster slot configuration hasn't changed, so no MOVED error will be received. The client will never be able to access that shard.

3) rolling upgrades. A slave is taken down to be upgraded. ioredis removes that node from the list. That slave is restarted. Next the master is taken down to be upgraded. The previously upgraded slave is elected as master. This is basically the same as item 2 above.

jcstanaway commented 6 years ago

So I think the discussion has gotten off topic of the original issue. I did a bit more digging and found that in cluster mode, if a node disconnects (e.g., the redis server restarts), that there is explicitly no attempt to reconnect to that node - even if the client specified a retryStrategy() in options.redisOptions. Specifically in connection_pool.js:

    redis = new Redis(_.defaults({
      // Never try to reconnect when a node is lose,
      // instead, waiting for a `MOVED` error and
      // fetch the slots again.
      retryStrategy: null,

I believe this is wrong. The connection could be lost due do a simple node restart (machine restart, redis server upgrade, etc). In these scenarios, there would never be a MOVED error and so we never reconnect to this node.

Further, the code overrides the client specified options and forces offline queuing to be enabled.

      // Offline queue should be enabled so that
      // we don't need to wait for the `ready` event
      // before sending commands to the node.
      enableOfflineQueue: true,

It's not clear why this needs to overridden. In my use cases, because I am using redis as a true cache, I have special handling when operations to update the cache can't complete. I may have other clients also writing to the cache and if those commands start to get queued up and then later dequeued and executed, those commands could get executed out-of-order and the cache has the wrong data.

The fact that these two options are being overridden is not documented. I'd like to see these overrides removed.

luin commented 6 years ago

@ccs018 The use cases you described won't be affected by setting enableOfflineQueue to true since offline queue only works when the connection is live or a retry attempt is happening. When a node is lost, ioredis will close the connection immediately because retryStrategy is null, so no more command will be added into the offline queue.

jcstanaway commented 6 years ago

The comment about enableOfflineQueue is more of a secondary issue. The primary problem is that retryStrategy is being set to null so that ioredis will not attempt to reconnect to that node when that node simply restarts. That needs to change. Assuming that there will eventually be a MOVED error is an invalid assumption.

ischer commented 4 years ago

Has there been a suitable workaround to this issue, or is a solution in progress? I am currently running into this issue, which has caused our custom scaleReads handler to drop nodes and only return the master. Since all of our reads must go to read replicas, this is not workable.

ajinkyarajput commented 4 years ago

Is there any solution we found on unreachable connected server? I am facing same issue with Redis Sentinels. When connected master become unreachable, it never retry to reconnects. Is this PR https://github.com/luin/ioredis/pull/658 is solution of this scenario?

headlessme commented 2 years ago

We're seeing this issue when a node in the cluster is restarted without any changes to the slot distribution, does anyone have a working solution?

menocomp commented 2 years ago

Same issue here when a replica node restart. redis-cli shows that the nodes is diconnected then after seconds as connected, however ioredis is not trying to get the nodes again, only the slots. That looks like a bug for me.

casret commented 2 years ago

We have major problems with redis not reconnecting when redis labs does maintanence on our cluster, I believe this is the underlying issue. Why hasn't any progress been made in 4 years?

xtianus79 commented 2 years ago

You may have to start a new issue @casret

roma-glushko commented 1 year ago

This issue seems to be directly related to one I have recently filled: https://github.com/luin/ioredis/issues/1732

roma-glushko commented 1 year ago

I have discovered a hack that could mitigate issues related to the current flaws in redis cluster topology updates: https://github.com/luin/ioredis/issues/1732#issuecomment-1476297274 Hope it will save your some time.

redis / ioredis

Cluster: Fail to reconnect to node #587