Watchdog doesn't renew RLock intermittently which leads to premature release of RLock

dayonekoo commented 7 months ago

Expected behavior

Expected sequence :

Thread 1 acquires RLock with default lockWatchdogTimeout of 30 seconds.
Redisson renews expiration on RLock for 5 minutes.
Other threads try to acquire RLock for 5 minutes but fail because of expiration renewals.
Thread 1 successfully releases RLock after 5 minutes.
Another thread successfully acquires the lock.

Actual behavior

Sometimes step 2 fails to happen for more than 30 seconds which results in another thread acquiring the same lock. Renew expiration doesn't happen for over 30 seconds which releases the lock.

Steps to reproduce or test case

Three separate pods try to acquire the same RLock every minute in the following way:

val channelLock = redissonLock.getLock(channelLock)
if (channelLock.tryLock(10, TimeUnit.SECONDS) {
   subscribeToChannel() // takes around 5 minutes to complete
   channelLock.unlock()
}

When failure happens, multiple threads end up subscribing to the same channel because channel lock is released before 5 minutes is up.

Redis version

Elasticache engine version 7.1.0

Redisson version

6.27.1

Redisson configuration

useClusterServers().apply {
                    password = redisPassword
                    // The AWS elasticache cluster only gives one main config endpoint
                    // Redisson determines from that endpoint how many nodes the redis cluster has
                    // Note: Mutable list since internal methods may call ClusterServersConfig.addNodeAddress()
                    nodeAddresses = mutableListOf(redisUrl)
                    this.retryAttempts = retryAttempts
                    // Redisson doesn't play nice with values that are not rounded to closest 100ms, see https://github.com/redisson/redisson/blob/09795b9b94c42e7a6985ecc1c804e5ca974f135f/redisson/src/main/java/org/redisson/connection/ServiceManager.java#L233-L246
                    val redissonTimerRounding = Duration.ofMillis(100)
                    val max100MsJitters = retryMaxJitter.dividedBy(redissonTimerRounding)
                    val retryJitter = redissonTimerRounding.multipliedBy((-max100MsJitters..max100MsJitters).random())
                    // Adjust retry attempts and interval with jitter to avoid RedisNodeNotFoundException on failover
                    // See https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
                    // When all pods use the same retry interval, the Redis nodes get overwhelmed with requests
                    // Exponential backoff is not available in Redisson
                    this.retryInterval = (retryInterval + retryJitter).toMillis().toInt()
                    logger.debug { "Retry jitter calculated to be $retryJitter for this pod, for total retry interval of ${this.retryInterval}" }
                    // Ping connection interval is also the ping timeout
                    // During deploys of many pods or Redis node failover, pings may timeout due to heavy traffic on the node
                    // Allow more time than the default of 30s
                    pingConnectionInterval = pingInterval.toMillis().toInt()
                }

mrniko commented 6 months ago

Sometimes step 2 fails to happen for more than 30 seconds which results in another thread acquiring the same lock

What is the reason? Do you see errors in log?

mrniko commented 6 months ago

As alternative you can use tryLock() with leaseTime parameter.

nigelmenger commented 6 months ago

@mrniko I am from the same team as @dayonekoo.

What is the reason? Do you see errors in log?

As @dayonekoo pointed out in https://github.com/redisson/redisson/issues/5697#issuecomment-2010889454, we are not able to see why the lock is released early, but we suspect that those early return conditions (or some other conditions within that block) are being evaluated causing renewExpirationAsync() to not trigger. But we can't tell for sure due to lack of these logs. Adding these trace logs would help us determine the root cause of the issue.

We notice in the implementation of RedissonBaseLock.renewExpiration() that we are returning early under certain circumstances (here & here) and we suspect that this may be relevant to our current situation of non-renewals. Is it possible for us to add trace level logs here for us to get further visibility on renew expiration behavior? As of now it's difficult to determine where exactly the renew expiration flow is going wrong.

As alternative you can use tryLock() with leaseTime parameter.

That would extend the time in a failure situation for the lock to be acquired by another process or thread. Our understanding is that without a lease time parameter, scheduleExpirationRenewal() is called which renews the lock expiration every 10s by default (30s / 3). So currently, in a failure situation, we have up to 30s to wait for a new process/thread to pick up the lock, whereas if we used a lease time we would have to set it to 60s to match our process that we hold the lock for, which extends this failover time. Is that correct?

Adding trace logs to renewExpiration(), scheduleExpirationRenewal(), and cancelExpirationRenewal() would help to allow us to debug what is going wrong.

mrniko commented 6 months ago

Which Redisson is it? There is no 6.27.1 you defined

nigelmenger commented 6 months ago

@mrniko We are using 3.27.2. 6.27.1 must have been a typo.

redisson / redisson

Watchdog doesn't renew RLock intermittently which leads to premature release of RLock #5698