fix: race condition between heartbeat and releasing lock

bfraterman-tkhsecurity commented 2 years ago

Releasing of a lock could fail (putItem would fail the condition) when there was a heartbeat running at the same time.

When the 2 calls to dynamoDB happen synchronously, the GUID used by the heartbeat would be put into dynamodb, while the previous GUID would be used to release the lock. Releasing of the lock would therefore fail. I've only tested this for FailOpen locks, where there were longer waits because of this. Perhaps with FailClosed locks the problem would be even worse.

The release of a lock now waits until the heartbeat is finished, using a Promise.

Since it relies on very specific timing it's hard to write a unit test for this.

tristanls commented 2 years ago

Thank you @bfraterman-tkhsecurity . What you described does look like a valid race condition for FailOpen lock, which can cause longer wait times (maximum leaseDurationMs) for released locks.

There is no problem with FailClosed lock because there are no heartbeats on those.

I think what we need is a semaphore. Now that you've pointed out the race condition, it is insufficient for release to wait for heartbeat, release and heartbeat can only happen one at a time.

Failure case interleaving with release only waiting for heartbeat to start:

heartbeat_start
release_start (wait)
heartbeat_commit
release_start (resume)
heartbeat_start
heartbeat_commit
release_commit (fail)

bfraterman-tkhsecurity commented 2 years ago

Basically, the self._heartbeatPromise is used as semaphore. If it exists, releasing the lock will wait until the heartbeat is finished. The other way around is safeguarded by self._released.

The failure scenario that you describe won't happen in this PR. Node is single-threaded, so no 2 pieces of code will run simultaneous. But whenever a thenable is used, it will be scheduled in node when doing something asynchronously, giving the possibility to yield to another function or thenable that is scheduled.

So the problem in the original code was:

heartbeat_start. Runs until self._config.dynamodb.put with a new GUID is called.
release_start. self._guid still contains the old GUID. Runs until self._config.dynamodb.put is called (with the old GUID).
heartbeat_finished. The put is done and the rest of the heartbeat code (in the put callback) runs. self._guid is set with the new GUID.
release_finish (fail). The put of the release method fails because of the conditional check on GUID mismatch in dynamoDB.

In the PR this is fixed by the following:

heartbeat_start. self._heartbeatPromise is set. All exit paths out of refreshLock resolve the Promise. self._released is checked so we're certain release_start didn't run yet. Runs until self._config.dynamodb.put is called.
release_start. Sets self._released. Checks if self._heartbeatPromise is set. If it is, it will wait for the promise to resolve.
heartbeat_finished. The put is done and the rest of the heartbeat code (in the put callback) runs. self._guid is set with the new GUID. The self._heartbeatPromise is resolved.
release_continue. The self._guid contains the new GUID. Runs until self._config.dynamodb.put is called (with the new GUID).
release_finish. The put of the release succeeds.

But maybe the code would become better readable when using something like https://www.npmjs.com/package/async-mutex. What do you think?

And do you want me to refactor the code to be able to test this in a unit test? This will grow the PR considerably I think, since all heartbeat and releasing code will have to be split up and moved around.

bfraterman-tkhsecurity commented 11 months ago

Now that AWS is deprecating the V2 SDK, we're looking at migrating to the V3 SDK.

I wonder if/how this PR can be merged? Otherwise we'll have to keep using our fork for our project.

tristanls commented 11 months ago

Hi @bfraterman-tkhsecurity ,

I got distracted and haven't come back to this repo since my last comment.

With a migration to v3, it probably makes sense to use your fork. I don't intend to spend more time on this module in the near future, so it may be worth publishing your v3 migrated fork for the community.

Thank you again for the PR.

Cheers

tristanls / dynamodb-lock-client

fix: race condition between heartbeat and releasing lock #35