madelson / DistributedLock

A .NET library for distributed synchronization
MIT License
1.86k stars 192 forks source link

Postgres lock is not released in specific multi-threaded scenarios #147

Closed Tzachi009 closed 1 year ago

Tzachi009 commented 1 year ago

Hello, I started to use this library in a project that I am working on and encountered a strange issue. The issue seems to be somewhat similar to another open issue - #115, but they might be different.

So a little bit about my app - it runs on .Net 6 and uses EF Core as ORM in order to invoke certain actions against Postgres DB. It receives some messages concurrently in different threads, and for each message, the app tries to create a connection to the DB, begin a transaction and then try acquiring a Postgres advisory lock via this library's API using that DB connection. If the lock is acquired, then some business logic is being done, and then the lock is released. If the lock was not acquired, then the thread re-tries a couple of times every few milliseconds, until it throws an exception if the lock cannot be held. Everything that was described here is being done in a complete asynchronous manner.

The code regarding the lock looks like this (I simplified it):

var dataContext = new DataContext();
var transaction = await dataContext.Database.BeginTransactionAsync(token);
var dbConnection = dataContext.Database.GetDbConnection();

var lockKey = GetLockKey(); // This can be any long
var postgresAdvisoryLockKey = new PostgresAdvisoryLockKey(lockKey);
var postgresDistributedLock = new PostgresDistributedLock(postgresAdvisoryLockKey, dbConnection);

await using (var distributedLockHandle = await postgresDistributedLock.TryAcquireAsync(timeout, token))
{
    // Some business logic...

    // Transaction is committed here 
}

// Transaction and Data Context are disposed here

It actually works, but only under specific multi-threaded scenarios. I ran some simple and short load tests on my app by sending it messages. The app is configured with different number of threads which are getting messages and then trying to acquire the same lock simultaneously. Apparently, when there are only 4 threads (or less) which try to acquire the lock, then it is being held and then released correctly. However, if the app runs with more than 4 threads (lets say 8), then after a few dozens of seconds, one of the thread supposedly releases the lock, and then no other thread can acquire it anymore, as if the lock was never actually released. It happened on every run that I did. I also tried to use the sync dispose function of the lock handle, but it did not change anything.

Now, when I look into the pg_locks table in Postgres using this query: SELECT * FROM pg_locks WHERE locktype = 'advisory'

I can see the following while the threads hang on the lock (8 threads), it stays the same until I kill the app: image

It does seem like the lock was not released all of a sudden, and I do not understand why. I added logs around everything, and it seems like the app is working correctly. Then I found the issue that I mentioned at the start, and started to wonder whether there is a bug in the library.

Am I using the library in a wrong way? Could it be that either the Dispose/AsyncDispose or TryAcquire functions are throwing exceptions and swallow them? Returning null values/Failing to release the lock? Is there any way to check it? I will be glad to hear your thoughts and answer any questions.

Thanks.

madelson commented 1 year ago

Hi @Tzachi009 thanks for filing.

On first glance, the only thing that seems atypical (but not wrong) about your scenario is that you are sharing a connection between EF and DistributedLock with an open transaction to boot. That shouldn't be an issue but possibly something is going wrong with how those features play together.

A few things to try:

Tzachi009 commented 1 year ago

Thanks for the quick response @madelson.

Tzachi009 commented 1 year ago

Ok, so I changed my code - used a connection string instead of a raw connection, and it indeed worked, the issue was not reproduced.

So, my whole plan was to have one connection that handles both the transaction and the lock in order to have better integrity of the lock. You even mentioned something about it in the docs here. In my case, I must prevent the very rare scenario, where the connection of the lock is somehow closed/broken and the lock is released before the transaction is committed, which in turn will allow a different thread to hold the lock and enter the part of the code that must run in isolation.

I would prefer to keep using the same connection if it is possible. Do you have any suggestions regarding what I can do about it? The library allows using a connection, so maybe I need to open a direct connection to the DB, instead using one managed by EF? Or maybe this part of the API should not be used like this?

madelson commented 1 year ago

@Tzachi009 thanks for the update. While using a connection string is a workaround, passing in an external connection is supposed to work, provided you don't do something like use the connection concurrently on multiple threads (which isn't supported by DbConnection in general).

What seems particularly odd to me about your scenario is that it seems to work sometimes but not every time; that smells like a race condition to me which could be in your code, in DistributedLock, or even in Npgsql.

I think the next step here is to get a minimal repro, something I can drop into the unit test project and debug against. Hopefully that should make it easier to get to the bottom of this.

EDIT: something else I just noticed; your code does this:

// create transaction
// acquire lock
// commit transaction
// release lock

It seems weird that we're interleaving creating/committing the transaction with creating/committing the lock. Would it work to refactor it to be like this instead?

// acquire lock
// create transaction
// commit transaction
// release lock
Tzachi009 commented 1 year ago

Hi, I changed the code - the transaction is now created only after the lock is acquired. As I mentioned before, it was like that only because initially I thought that doing so will acquire a transactional advisory lock. However, it did not change a thing regarding the issue. I also did a couple more runs and for the first time the issue occurred even though the app ran with just 4 threads. It did not happen a second time, so I agree there is probably a race condition somewhere.

As you asked, I created a repro that reproduce the issue - https://github.com/Tzachi009/distributed-locks-issue. It is a simple console app. The code there is similar to my own code. You may need your own Postgres DB. You can change the connection string from the appsettings file. If you will have any questions, I will gladly answer.

madelson commented 1 year ago

Thanks I'll be able to take a deeper look into this soon. Out of curiosity, why do you have code like this?

for (int retryAttempt = 0; (!isLockAcquired) && retryAttempt <= lockRetrievalRetriesCount; retryAttempt++)
                {
                    isLockAcquired = await _distributedLockManager.TryAcquire(distributedLockKey, dbConnection, lockRetrievalTimeout.Value, token);

                    if (!isLockAcquired)
                    {
                        await Task.Delay(lockRetrievalSleepTimeBetweenRetries.Value, token);
                    }
                }

Why not just use a longer timeout?

Tzachi009 commented 1 year ago

Are you talking about the lock timeout that I am sending to the library? I simply do not want to recieve a too much delayed result regarding the lock handle, if the lock cannot be held, so a larger timeout is not that good for me.

If you refer to the Task.Delay part, then in my actual app's code, I expect that the lock will be hold for a few dozens of milliseconds on average, if not most of the time. I also want to avoid throwing an exception just because the lock has been acquired by other threads, so I let the thread try a few time to acquire the lock and sleep between retries. I also want to avoid blocking the thread for too long handling one message.

The values for the retries and timeouts in the test app are simply for testing. The bottom line is that, in a production environment, I want to minimize the time of processing the message that a thread handles, when the lock is being hold by other threads, if possible. I believe that the current code will allow me to do it and balance things out (I can also configure the values from outside the process, of course).

If you think something that I did is wrong, then please do tell.

madelson commented 1 year ago

@Tzachi009 Ok so I have your code running locally and I can reproduce the issue. Interestingly, when I change the retry loop approach I mentioned above to use a single wait instead like this:

                isLockAcquired = await _distributedLockManager.TryAcquire(
                    distributedLockKey,
                    dbConnection,
                    TimeSpan.FromTicks((lockRetrievalTimeout.Value.Ticks + lockRetrievalSleepTimeBetweenRetries.Value.Ticks) * (lockRetrievalRetriesCount + 1)),
                    token);

The problem disappears for me. Can you reproduce the same? There is definitely still an underlying race condition here that I'll continue to look into, but this might be a good workaround to consider for now.

Tzachi009 commented 1 year ago

Interesting, I can confirm that I can reproduce the same result. I guess that the retries may have triggered the race condition. Regardless. it seems like it solves the issue, so I think it may be a good workaround for now.

Thanks for the help @madelson. I will be glad to be notified when you do find the source of the race condition.

madelson commented 1 year ago

Glad that this workaround seems effective. Yes I was referring to looping with retries with sleeps vs. a single wait with a longer timeout. I think in general you’d want the single wait since then you get better fairness (threads stay in line vs repeatedly giving up) and less resource usage due to fewer DB round trips.

madelson commented 1 year ago

@Tzachi009 ok I've used your repro to track down the root cause of the race condition!

It was indeed related to the particular pattern of usage: multiple retries on the same connection where each individual try has a very short timeout. Essentially here's what happens:

One fix I've come up with is to always execute a try-release upon lock timeout just to make sure things are cleaned up.

Tzachi009 commented 1 year ago

Interesting. It does sounds like a bug in postgres layer, and your fix sounds good enough.

madelson commented 1 year ago

Got confirmation that this is Postgres behavior: https://www.postgresql.org/message-id/17686-fb1fa3870138e394%40postgresql.org

Working on a simple fix which is to just re-check whether the lock is acquired after a timeout.