Lock not returning on AWS ElastiCache

lakamsani commented 7 years ago

Hi, thanks for making this available. I have a question regarding this comment in the configuration section of README.

// you should have one client for each redis node // in your cluster

I am running a 9 node redis cluster with 3 shards and each shard supported by one primary and 2 slaves hosted in Amazon Elastic cache. For the normal use of Redis in my node based app cluster I just use a single load balancer provided DNS name to access the cluster for various redis operations. I do have redis-cli access to each of the individual nodes and therefore able to supply individual clients for redlock.

To use this cluster with redlock, do I need to create clients for all 9 nodes or is it OK to just supply a separate client for each of the 3 shard masters (or any other 3 nodes).

lakamsani commented 7 years ago

Testing with a single node redis on my mac laptop initially with node 7.5.0 and redis 3.2.5 Got this error trying to get a lock.

TypeError: server.eval is not a function
    at request (/Users/me/my-node-app/node_modules/redlock/redlock.js:268:19)
    at /Users/me/my-node-app/node_modules/redlock/redlock.js:322:12
    at Array.forEach (native)
    at attempt (/Users/me/my-node-app/node_modules/redlock/redlock.js:321:24)
    at /Users/me/my-node-app/node_modules/redlock/redlock.js:326:10
    at Promise.wrappedExecutorCaller [as _execute] (/Users/me/my-node-app/node_modules/newrelic/lib/instrumentation/promise.js:216:18)
    at Promise._resolveFromExecutor (/Users/me/my-node-app/node_modules/bluebird/js/release/promise.js:475:18)
    at new Promise (/Users/me/my-node-app/node_modules/bluebird/js/release/promise.js:77:14)
    at Redlock._lock (/Users/me/my-node-app/node_modules/redlock/redlock.js:257:9)
    at Redlock.lock (/Users/me/my-node-app/node_modules/redlock/redlock.js:119:14)

lakamsani commented 7 years ago

That was a client setup issue with Redlock. Locking and unlocking is working with a single redis node. will try multi-node cluster shortly. let me know about the initial question. thx .

lakamsani commented 7 years ago

in the cluster getting MOVED errors. not sure if node_redis should handle these, redlock or my app code.

 A redlock error has occurred: {"command":"EVAL","args":["if redis.call(\"get\", KEYS[1]) == ARGV[1] then return redis.call(\"del\", KEYS[1]) else return 0 end",1,"gametile:update-lock:58a4e3b6df3b7e41337b43b8","0f975da610ac1a586eabf7106b994160"],"code":"MOVED"}

lakamsani commented 7 years ago

Tried a small test by passing ioredis client handle during initialization and that test has passed. Still evaluating how many nodes to pass into the client array etc.

lakamsani commented 7 years ago

locking worked in the cluster for a day but is now giving this error on any lock attempt.

LockError: Exceeded 10 attempts to lock the resource "update-lock:58a75c9608b0c65b34a665ba".
   |     at /home/ec2-user/onkore-node/node_modules/redlock/redlock.js:317:20
   |     at tryCatcher (/home/ec2-user/onkore-node/node_modules/bluebird/js/release/util.js:16:23)
   |     at Promise.successAdapter [as _fulfillmentHandler0] (/home/ec2-user/onkore-node/node_modules/bluebird/js/release/nodeify.js:22:30)
   |     at Promise._settlePromise (/home/ec2-user/onkore-node/node_modules/bluebird/js/release/promise.js:558:21)
   |     at Promise._settlePromise0 (/home/ec2-user/onkore-node/node_modules/bluebird/js/release/promise.js:606:10)
   |     at Promise._settlePromises (/home/ec2-user/onkore-node/node_modules/bluebird/js/release/promise.js:685:18)
   |     at Async._drainQueue (/home/ec2-user/onkore-node/node_modules/bluebird/js/release/async.js:138:16)
   |     at Async._drainQueues (/home/ec2-user/onkore-node/node_modules/bluebird/js/release/async.js:148:10)
   |     at Immediate.Async.drainQueues (/home/ec2-user/onkore-node/node_modules/bluebird/js/release/async.js:17:14)
   |     at runCallback (timers.js:649:20)
   |     at tryOnImmediate (timers.js:622:5)
   |     at processImmediate [as _immediateCallback] (timers.js:594:5)

mike-marcacci commented 7 years ago

Hi @lakamsani, sorry for the slow reply – when using a single cluster, a single client is sufficient. If you were to use multiple independent clusters, or multiple independent (ie non-clustered) redis nodes, then you would use multiple clients.

That's certainly curious about the most recent error. This suggests that the lock is already checked out by another routine. If one shouldn't be acquired, there may have been a case where the locked was never released; this deadlock should eventually be resolved by the timeout.

lakamsani commented 7 years ago

@mike-marcacci thanks for making this implementation available and feedback on host spec for single cluster. We are indeed using a single client that represents the load balancer provided host name for the 9 nodes in cluster mode. Regarding that last error I did a GET on the lock name via redis-cli -c and I don't see it anywhere. So no node process in the cluster is able to execute that critical section routine. Any other debug logging we can turn on to see what is happening?

mike-marcacci commented 7 years ago

Hmm, what kind of contention are you seeing on these resources? That is, how many concurrent conflicting requests are being made? I am curious if contending nodes are preventing each other from ever acquiring the lock. I have wanted to add a retry jitter for a while to prevent this kind of issue, so you may have the use case that gets me to do it.

lakamsani commented 7 years ago

its a little strange. I have 3 envs, dev,test and production each successively larger foot print of node servers and redis cluster. The same code worked in dev and test yesterday afternoon but stopped working from early morning today (2/17). in dev env I just have two node servers and it is failing when just one server tries to get a lock. All the servers are set to UTC timezone. Redis cluster is Amazon managed (ElastiCache) in the us-west-2 zone. not sure how to get the timezone on those VMs. Is there a way to get it via redis-cli?

lakamsani commented 7 years ago

It still works on a single node node.js and non-cluster setup redis on my dev laptop where redis 3.2.5 is a docker container running on alpine linux. In local env connection is made via ioredis.createClient. On EC2, AWS ElastiCache is 3.2.4 where the connection is made via ioredis.Cluster. But as you can imagine I need this to work with the cluster.

mike-marcacci commented 7 years ago

Hey @lakamsani, I know this is very old now and you've hopefully found another solution that's working for you. I'm going to go ahead and close this for now, but please reopen if you (or anybody else) experiences a similar issue.

lakamsani commented 7 years ago

@mike-marcacci sounds good. For now we are using the simplistic approach of SET resource-name anystring NX EX max-lock-time and it seems to work for us. We will revisit redlock debugging when we find more time and/or if the simpler approach starts failing in a AWS ElastiCache cluster.

sarattatavarthi commented 6 years ago

@mike-marcacci @lakamsani we are also facing similar issue, and we are also using elasticcache, any help would be appreciated

mike-marcacci commented 6 years ago

Hi @sarattatavarthi, I'll reopen this, and I'm going to rename it so it's easier for other ElastiCache users to find. I haven't actually tried it on AWS, so I'll spin up a test cluster when I get a moment.

mike-marcacci commented 6 years ago

Also, just something you might consider; it looks like node redis doesn't have the same cluster support as ioredis. If you are using the former, try the latter.

mike-marcacci commented 6 years ago

Hi @sarattatavarthi - did switching to ioredis make a difference here?

mike-marcacci commented 6 years ago

@lakamsani @sarattatavarthi I'm going to go ahead and close this issue, as this is the exact behavior that happens when using node-redis as the client in a cluster, and can be fixed by switching to ioredis.

I added a note to the readme calling this out. Please feel free to re-open if you (or anyone else) encounters this issue while using ioredis.

itayadler commented 5 years ago

@mike-marcacci on a single node deployment, is it critical to use ioredis as well to avoid this issue?

mike-marcacci commented 5 years ago

@itayadler it is not. The only difference that should matter is that ioredis works with cluster commands and responses, which is necessary for sharding and failover. If you are using a managed instance, I would definitely use ioredis, because if it scales up, you will want a client that knows what to do.

Are you experiencing any issues on a single node, or are you just picking a client?

mike-marcacci / node-redlock

Lock not returning on AWS ElastiCache #29