madelson / DistributedLock

A .NET library for distributed synchronization
MIT License
1.74k stars 182 forks source link

RedisDistributedLock Releases Lock Prematurely Before ParallelMethod Completes #190

Open BroderQi opened 4 months ago

BroderQi commented 4 months ago
// Using an open-source DistributedLock framework, the method keeps executing while the lock is released after a lease period (before the method finishes).

public void UseLock()
{
    var redisDistributedLock = new RedisDistributedLock("myKey", ConnectionMultiplexer.Connect("xxx.xxx.xxx.xxx:6379").GetDatabase());

    using (var handle = redisDistributedLock.TryAcquire())
    {
        ParallelMethod();
    }
}

private void ParallelMethod()
{
    List<int> numbers = Enumerable.Range(0, 30001).ToList();

    int processedCount = 0;
    Parallel.ForEach(numbers, item =>
    {
        Interlocked.Increment(ref processedCount);
        Thread.Sleep(500);

        Console.WriteLine($"There are still {numbers.Count - processedCount} items pending processing");
    });
}
madelson commented 4 months ago

@BroderQi I don’t understand what your repro is trying to show. You have just one lock call wrapping the parallel execution. What happens? What do you expect to happen?

also, notice that you are calling TryAcquire but you are not checking the handle for null. See examples in the docs for how to use tryacquire, or just use acquire.

BroderQi commented 4 months ago

image

The ParallelMethod successfully acquired the lock and started execution, but the LeaseMonitor released the lock before waiting for the completion of the ParallelMethod.

Because at that if statement, the OnHandleLost method was successfully executed.

I expect the lock to be released only after the completion of the ParallelMethod.

madelson commented 4 months ago

Does the issue only repro with Parallel? What if you just have a very long sleep instead with no parallel? If it’s parallel-specific, you could be hitting thread starvation where the renewal process can’t get in and do its work.

also, does increasing the duration in the options affect it? I would assume that setting a duration longer than you need to hold it would fix the issue.

BroderQi commented 4 months ago

Thank you for your response. The above-mentioned issue only arises in multi-threaded scenarios, so I will increase the TTL time. Additionally, can we adjust the priority of renewal tasks or employ other scheduling strategies to ensure the stability of the renewal thread?

madelson commented 4 months ago

Does increasing the TTL actually fix it for you? I just want to confirm before we assume renewal scheduling is the issue.

Currently there are 2 configurable parameters: base ttl (expiry) and renewal cadence (extension cadence). Right now we default to 30s and 10s.

In your scenario, I’m curious what timings you observe for when the lease is observed to be lost. If the problem is that extension comes too late, I’d expect this to happen some time after the 30s mark.

As for what we can do to fix this, if the problem is truly thread starvation I’m not sure there’s anything that can fix it 100%, since ultimately we need to get in there and run some tasks to do the renewal (multiple tasks since this is all async). I’m open to ideas though!

However we could probably make it much less likely by shifting the default ratio so that extension is smaller relative to expiry. Users can of course set the TTL arbitrarily high; the only downside of doing this is that if the process holding the lock crashes the high TTL is how long you have to wait before someone else can acquire.

hoerup commented 4 months ago

Imho this could be solved with a documentation note on DistributedLock.Redis that if you use a large amounts of tasks via eg. via Parallel.For / Parallel.ForEach that there is a risk of losing a renewal due to thread starvation.

The note should also mention that, if you want to use DistributedLock.Redis together with Parallel.x you should either set ParallelOptions.MaxDegreeOfParallelism or create a new TaskScheduler for ParallelOptions.TaskScheduler so the parallel invocation has it's own thread pool.

madelson commented 3 months ago

@hoerup I think the guidance is a bit broader: if you are going to exhaust the thread pool then any lease-based locks need a higher TTL to avoid spurious expiry since the renewal mechanism relies on being able to run background tasks. Alternatively as you say you can move your workload off the default scheduler to prevent this.