tristanls / dynamodb-lock-client

A general purpose distributed locking library built for AWS DynamoDB.
MIT License
50 stars 19 forks source link

fix: race condition between heartbeat and releasing lock #35

Open bfraterman-tkhsecurity opened 2 years ago

bfraterman-tkhsecurity commented 2 years ago

Releasing of a lock could fail (putItem would fail the condition) when there was a heartbeat running at the same time.

When the 2 calls to dynamoDB happen synchronously, the GUID used by the heartbeat would be put into dynamodb, while the previous GUID would be used to release the lock. Releasing of the lock would therefore fail. I've only tested this for FailOpen locks, where there were longer waits because of this. Perhaps with FailClosed locks the problem would be even worse.

The release of a lock now waits until the heartbeat is finished, using a Promise.

Since it relies on very specific timing it's hard to write a unit test for this.

tristanls commented 2 years ago

Thank you @bfraterman-tkhsecurity . What you described does look like a valid race condition for FailOpen lock, which can cause longer wait times (maximum leaseDurationMs) for released locks.

There is no problem with FailClosed lock because there are no heartbeats on those.

I think what we need is a semaphore. Now that you've pointed out the race condition, it is insufficient for release to wait for heartbeat, release and heartbeat can only happen one at a time.

Failure case interleaving with release only waiting for heartbeat to start:

bfraterman-tkhsecurity commented 2 years ago

Basically, the self._heartbeatPromise is used as semaphore. If it exists, releasing the lock will wait until the heartbeat is finished. The other way around is safeguarded by self._released.

The failure scenario that you describe won't happen in this PR. Node is single-threaded, so no 2 pieces of code will run simultaneous. But whenever a thenable is used, it will be scheduled in node when doing something asynchronously, giving the possibility to yield to another function or thenable that is scheduled.

So the problem in the original code was:

In the PR this is fixed by the following:

But maybe the code would become better readable when using something like https://www.npmjs.com/package/async-mutex. What do you think?

And do you want me to refactor the code to be able to test this in a unit test? This will grow the PR considerably I think, since all heartbeat and releasing code will have to be split up and moved around.

bfraterman-tkhsecurity commented 11 months ago

Now that AWS is deprecating the V2 SDK, we're looking at migrating to the V3 SDK.

I wonder if/how this PR can be merged? Otherwise we'll have to keep using our fork for our project.

tristanls commented 11 months ago

Hi @bfraterman-tkhsecurity ,

I got distracted and haven't come back to this repo since my last comment.

With a migration to v3, it probably makes sense to use your fork. I don't intend to spend more time on this module in the near future, so it may be worth publishing your v3 migrated fork for the community.

Thank you again for the PR.

Cheers