Open jcstanaway opened 6 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Keep alive. This issue needs to be resolved.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Keep alive. This issues needs to be resolved.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Keep alive. This issue still needs to be resolved.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.
freshbot
Hey @ccs018, as far as I know, ioredis only refreshes the cluster nodes list when there is a slot change.
Can you try to change one of the slots and see if ioredis reconnects to that slave?
You are correct that we should add a refresh interval for the cluster nodes list, I'll see if I can hook something up
@shaharmor , thanks for the response. I could try a slot change, but I believe this also is not immediately detected, but only if an operation results in a MOVED response.
Note that I've raised a couple other related issues regarding how ioredis manages cluster connections. In this particular issue, there is no real change in the cluster topology. Rather one of the nodes failed and ioredis decides to never attempt to reconnect to that node. If you query any node in cluster, that temporarily failed node was never removed from the cluster, yet ioredis decides that it is gone forever. In this scenario, ioredis should not unilaterally remove the node from its view of the cluster, but should continue to retry re-connecting to that node. As I noted above error may be related to the following debug output:
ioredis:connection skip reconnecting because `retryStrategy` is not a function +0m
yet in my configuration of ioredis, retryStrategy
IS a function. I suspect that there is an error in handling the options passed to Cluster().
Can you share how you are calling the Redis.Cluster
constructor?
I could, but it's gotten rather complex as I've had to work around the various issues I've logged and no longer have the original code. Between this issue and not handling the exception case of trying to connect to the cluster before the cluster is actually formed (startup race condition and the ioredis client hangs) and not proactively handling actual changes to the cluster topology, I've written a lot of logic to detect and recover from these conditions.
I'll try to find some time to write a sample client, but it should be rather easy to reproduce. Just set up a cluster, connect a client and then stop one of the nodes. Wait a minute and then restart the node. You'll see that ioredis does not automatically reconnect to that node.
startupNodes = [{host:127.0.0.1, port:6400},{host:127.0.0.1, port:6410},{host:127.0.0.1, port:6420}] Other nodes in the cluster are 127.0.0.1 ports 6401, 6411 and 6421 and discovered dynamically. Non-default options:
enableOfflineQueue: false
scaleReads: true
redisOptions.connectTimeout: 2000
redisOptions.autoResendUnfulfilledCommands: false
redisOptions.reconnectOnError = function(err) {
let targetError = 'READONLY';
if (err.message.slice(0, targetError.length) === targetError) {
// Only reconnect when the error starts with "READONLY".
return 2;
}
};
redisOptions.readOnly: true
I don't recall if I was initially using the builtin for clusterRetryStrategy
and/or redisOptions.retryStrategy
but you can look at #586 for a related issue.
@ccs018 Hi, sorry for the really late response.
According to the official documentation https://redis.io/topics/cluster-spec#clients-first-connection-and-handling-of-redirections, ioredis will refresh the slots only when a MOVED
error is received. That means even when a node is down temporarily and it's not removed from the cluster, ioredis will disconnect it and won't reconnect.
I'm wondering whether this should be a problem since once sending a command that belong to this node, ioredis will receive a MOVED
error, at which time ioredis refresh the slots information and connect to the lost node.
MOVED is only applicable if slots were moved from one shard to another. There are many scenarios where there's an issue with the current implementation where there is never a MOVED error.
Several scenarios come to mind:
1) read-only slaves. If I want to scale my read performance and enable read operations on slave nodes, then if a slave node restarts (or more like the machine the slave node is on restarts), the slave node is removed by ioredis and never reconnects thus impacting my read performance.
2) high availability: A slave node restarts and ioredis removes it from the cluster list. At some point later, the master goes down. The slave node which previously restart is elected master. The cluster slot configuration hasn't changed, so no MOVED error will be received. The client will never be able to access that shard.
3) rolling upgrades. A slave is taken down to be upgraded. ioredis removes that node from the list. That slave is restarted. Next the master is taken down to be upgraded. The previously upgraded slave is elected as master. This is basically the same as item 2 above.
So I think the discussion has gotten off topic of the original issue. I did a bit more digging and found that in cluster mode, if a node disconnects (e.g., the redis server restarts), that there is explicitly no attempt to reconnect to that node - even if the client specified a retryStrategy() in options.redisOptions
. Specifically in connection_pool.js
:
redis = new Redis(_.defaults({
// Never try to reconnect when a node is lose,
// instead, waiting for a `MOVED` error and
// fetch the slots again.
retryStrategy: null,
I believe this is wrong. The connection could be lost due do a simple node restart (machine restart, redis server upgrade, etc). In these scenarios, there would never be a MOVED
error and so we never reconnect to this node.
Further, the code overrides the client specified options and forces offline queuing to be enabled.
// Offline queue should be enabled so that
// we don't need to wait for the `ready` event
// before sending commands to the node.
enableOfflineQueue: true,
It's not clear why this needs to overridden. In my use cases, because I am using redis as a true cache, I have special handling when operations to update the cache can't complete. I may have other clients also writing to the cache and if those commands start to get queued up and then later dequeued and executed, those commands could get executed out-of-order and the cache has the wrong data.
The fact that these two options are being overridden is not documented. I'd like to see these overrides removed.
@ccs018 The use cases you described won't be affected by setting enableOfflineQueue
to true
since offline queue only works when the connection is live or a retry attempt is happening. When a node is lost, ioredis will close the connection immediately because retryStrategy
is null, so no more command will be added into the offline queue.
The comment about enableOfflineQueue
is more of a secondary issue. The primary problem is that retryStrategy
is being set to null so that ioredis will not attempt to reconnect to that node when that node simply restarts. That needs to change. Assuming that there will eventually be a MOVED
error is an invalid assumption.
Has there been a suitable workaround to this issue, or is a solution in progress? I am currently running into this issue, which has caused our custom scaleReads
handler to drop nodes and only return the master. Since all of our reads must go to read replicas, this is not workable.
Is there any solution we found on unreachable connected server? I am facing same issue with Redis Sentinels. When connected master become unreachable, it never retry to reconnects. Is this PR https://github.com/luin/ioredis/pull/658 is solution of this scenario?
We're seeing this issue when a node in the cluster is restarted without any changes to the slot distribution, does anyone have a working solution?
Same issue here when a replica node restart. redis-cli shows that the nodes is diconnected then after seconds as connected, however ioredis is not trying to get the nodes again, only the slots. That looks like a bug for me.
We have major problems with redis not reconnecting when redis labs does maintanence on our cluster, I believe this is the underlying issue. Why hasn't any progress been made in 4 years?
You may have to start a new issue @casret
This issue seems to be directly related to one I have recently filled: https://github.com/luin/ioredis/issues/1732
I have discovered a hack that could mitigate issues related to the current flaws in redis cluster topology updates: https://github.com/luin/ioredis/issues/1732#issuecomment-1476297274 Hope it will save your some time.
I'm running a simple test with 3 masters + 1 slave per master. If I take down one of the slaves and restart it, ioredis fails to reconnect to that node.
Note, however, that retryStrategy is a function. I've repeated this failure with 1) no options supplied new Cluster, 2) with options, but not setting options.redisOptions.retryFunction and 3) with option, but setting options.redisOptions.retryFunction to function() { return 3000; }.
Even long after the node has been restarted and rejoined the cluster (verified with redis-cli cluster slots and cluster nodes), ioredis continues to report the cluster.nodes().length = 5.
So, the only way to recover from this very basic failure mode is for the application to monitor the '-node' event and destroy and re-create the client. This is a bad defect.
I would have hoped that ioredis was periodically monitoring the cluster configuration by periodically performing a cluster slots command, it is clearly impossible for ioredis to detect if new slaves are ever added to the cluster in an expansion scenario.