Closed stanley115 closed 4 years ago
Any update on this issue? I am using ioredis and facing the same issue as well.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.
I'm encountering this same issue when testing an elasticache failover/upgrade. (Elasticache redis 5.0.0, ioredis 4.14.0)
My client is running a mix of individual commands and multi
command blocks. When an elasticache resource upgrade occurs, the connection is closed and reestablished. However, the ioredis command queue falls out of sync, and I begin seeing individual redis commands being resolved with the reply result of a different command.
My debugging so far has shown that during the connection reset, commands sent inside of a multi context are getting put individually on the offline queue, which on its own doesn't seem right.
Just before a failover takes place, I send a multi block that looks like this:
multi
del
hset
expire
setex
exec
All of these commands get queued up by ioredis, but then I observe the following:
multi
)READONLY
error in response to del
, due to the ongoing ElastiCache resource upgrade.del
command is put on the offline queuedel
is sent from the offline queue'1'
(in response to del
). We were expecting QUEUED
here, but the multi context was lost in the reconnect.hset
command with the reply value '1'
. The command queue is now out of syncexpire
command, and so on.It seems that if a connection reset occurs when a multi block is still in flight we might need to either flush or re-send any multi'd commands that might still be in the queue. I'll try to run this experiment in a fork and see if it resolves this issue.
To start, I think perhaps transacted commands found in prevCommandQueue
or offlineQueue
should be rejected. A quick experiment trying this appears to resolve the issue I'm facing.
In the meantime @stanley115 are you able to reproduce the issue with autoResendUnfulfilledCommands
set to false?
This just hit us on production, after recovering from a failover our sites started showing content from different users mixed together. Any workarounds?
As far as workarounds, we simply have to restart all connected services whenever we perform an elasticache upgrade. I'm not aware of any better workaround, but I do have a potential fix working locally that I've been meaning to PR. I've been sidetracked since then.
:tada: This issue has been resolved in version 4.16.1 :tada:
The release is available on:
Your semantic-release bot :package::rocket:
Hi,
I am using ioredis with version 4.2.0 and I found ioredis may response unexpected result when "failover primary" is triggered on Amazon ElastiCache.
To verify this issue, I wrote a program to do the following steps:
The source code is as follows:
When the program is executed, it runs normally without any error at the beginning:
However, when I trigger "failover primary" is triggered on Amazon ElastiCache, it responses the following error:
and the program terminate with the following error:
PS: I found this issue cannot be reproduced when maxRetriesPerRequest is set to be 0, and I strongly suspect that there is some state corruption when retry happen.
Here is the ElastiCache cluster details: