samcook / RedLock.net

An implementation of the Redlock algorithm in C#
MIT License
949 stars 167 forks source link

Lock not extended #97

Closed ssougnez closed 2 years ago

ssougnez commented 2 years ago

Hello,

the application is running scheduling jobs and uses redlock to synchronize them (avoid multiple execution at the same time, or on different servers, etc...)

Basically, the code is the following:

using var rlock = _redLockFactory.CreateLock(GetType().FullName, TimeSpan.FromSeconds(30));

if (rlock.IsAcquired == true)
{
    await ExecuteAsync(context.CancellationToken);
}

This is contained in a block, so I'm sure that the lock is disposed correctly. However, there is a faulty behavior with this. Indeed, I tried to create a job that simply does:

await Task.Delay(30000);

Than I launched the app and watch the state of the redlock. At first, it starts with an expiry of 30s then every second it decreased. Once it reaches 15, it gets updated to have an expiry of 30s.

So this is the correct behavior.

However, the behavior is not always consistent. Indeed, for some reason, at some point, the lock does not get extended and reaches 0 then gets recreated.

This should not happen because either the job is running and the code above is supposed to extend the lock or the job terminates and disposes the lock (so removes it from Redis) so either way, the lock should never have an expiry below 15s.

Moreover, I downgraded to 2.2.0 to see if I had the same behavior and... I didn't... so it seems to be a critical regression in the 2.3.1.

Indeed, I'm using this system in multiple applications and for each of them I used to have some weird behavior from time to time and this bug explains them all.

Cheers

samcook commented 2 years ago

Thanks for the report.

There were some changes made around lock extension between 2.2.0 and 2.3.1 to avoid a race condition during unlocking - so potentially there's something not quite right there.

Are you able to reproduce the problem with RedLock logging enabled at Debug level (pass in an ILoggerFactory when creating your RedLockFactory) and post the logs here?

ssougnez commented 2 years ago

So... I don't have much experience with LoggingFactory so I'm not sure I did it correctly:

RedLockFactory.Create(endPoints, LoggerFactory.Create(builder => builder.AddConsole()))

And in the appSettings, I put:

"LogLevel": {
  "RedLockNet":  "Debug"
}

You'll find the logs at the end of this mail. Besides, while doing these tests, I could narrow down a bit the scenario when it occurs. So let's say that every second, the code tries to acquire a lock for a job with this line:

var rlock = _redLockFactory.CreateLock(GetType().FullName, TimeSpan.FromSeconds(30))

When it starts, it gets a lock, so an entry is created into Redis with an expiry of 30s. The job takes exactly 30 sec to execute (as it's using Task.Delay(30000) for the tests). After 15 seconds, the lock is extended. After 15 more seconds, the line above is executed before the termination of the job, so the lock is extended. However, right after, the job terminates, so the lock gets deleted. Less than one second after, the code tries to re-execute the job by executing the line above and acquires a new lock. And, for whatever reason, it is THIS lock that does not get extended correctly. It just decreases until 0 then gets recreated. When I check the logs (below), I don't really find anything weird but I suspect that I didn't do it correctly as I only see info logs and no debug ones.

info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (35df3957-3ea8-4adb-8309-ed4b82571d4d) info: RedLockNet.SERedis.RedLock[0] Lock status: Acquired (Acquired: 1, Conflicted: 0, Error: 0), SFPD.KmNet.Jobs.ImportJob (a4fa28c2-ada7-40ae-a6d6-f9e22769dce2) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (8982a3af-904d-469c-8002-a17355d9237f) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (0e847140-f2de-4f20-b333-20430b5a947f) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (f5dabb5d-8bd2-4c4a-acf3-0c2e18ae8c9f) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (096b7272-9b67-44c8-8d34-9e8179684513) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (df4021a9-d4f3-4f36-b3af-2a8cc5e29f49) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (090d8541-03fe-4aec-962f-b9998cdfee22) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (5fdc41d8-0fd9-4ea4-b16a-924396976349) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (9d353384-a26d-4980-80d2-e31740871540) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (d1b0d6da-c4db-4eaf-b046-7749b0aaea77) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (918e4ea2-339b-41ba-a081-32a24938f97d) info: RedLockNet.SERedis.RedLock[0] Lock status: Acquired (Acquired: 1, Conflicted: 0, Error: 0), SFPD.KmNet.Jobs.CleaningJob (c4d48634-d530-40f4-a89e-5fcb00e0e5cf) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (6ab26856-17d6-4ca9-8eba-eb60f1c8a1ad) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (86900e3c-075a-4f74-8054-3fab06eded8b) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (d64672e3-2f60-4441-9881-6bddde573e00) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (029268ce-3b2a-42f3-836f-f8082ef63df3) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (544da76f-681d-4e4c-86d9-a4ee2376442a) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (6f7850a9-7b6a-40e4-af24-e88ad1582591) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (a573cb20-4b63-428f-92d6-3b700bb2aff4) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (24add582-eb3a-42e9-b496-9d9d3d1da484) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (440c03b4-8af7-4c27-aba2-9c8089f7f28c) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (36e8855d-6c08-4245-b891-85a2831fbe03) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (424e5ca7-23cf-4aa9-a885-c54fdaefc200) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (740be884-49b8-460e-b75b-daf6de766acc) info: RedLockNet.SERedis.RedLock[0] Lock status: Conflicted (Acquired: 0, Conflicted: 1, Error: 0), SFPD.KmNet.Jobs.ImportJob (40d826af-1700-4975-91b1-c9c6f8688a41)

samcook commented 2 years ago

I think you can pass in a LogLevel to AddConsole() if you can't get it to pick it up from the app settings

ssougnez commented 2 years ago

I tried but there is no overload accepting a log level. I tried the syntax in your test project but I'm using .net 6 and the way to create the factory changed a bit. I'll dig deeper tomorrow if I have some times.

samcook commented 2 years ago

Yeah I was trying to repro in a test case myself.

The syntax for the newer logging looks something like this:

LoggerFactory.Create(b => b.AddSimpleConsole(opts =>
{
    opts.TimestampFormat = "HH:mm:ss.ffff ";
    opts.SingleLine = true;
})
.SetMinimumLevel(LogLevel.Trace));
samcook commented 2 years ago

Also, does the situation where the lock doesn't extend the second time still occur if there isn't any contention for the lock by other threads/processes?

markheys commented 2 years ago

We are seeing a similar issue in our production environment that we cannot reproduce in our test environment so must be related to load. When we roll back to 2.2.0 from 2.3.1 the issue goes away.

samcook commented 2 years ago

I've managed to reproduce the issue and identify the problem. There was a bug introduced in 2.3.1 where, if the lock wasn't acquired on the first attempt (e.g. due to it already being held) but was subsequently acquired during a retry, the extend timer would be cancelled and the lock would not auto renew.

I've pushed a pre-release version 2.3.2-pre1 if you'd like to give that a try and verify that it solves the issue for you.

ssougnez commented 2 years ago

Ok I'll try it soon.

Sorry for the delay, I'm currently on holiday so... not much time

markheys commented 2 years ago

I will put it in our "test" environment now and if no issues try it in "production" Thanks.

markheys commented 2 years ago

It ran a couple hours in "test" with no issues and overnight in "production" with no issues. Last time we put it in "production" issues appeared within minutes so I feel like your fix has resolved this.

samcook commented 2 years ago

Thanks for testing, I've pushed out version 2.3.2 (which is the same as 2.3.2-pre1) which contains the fix for this issue.