Open bfraterman-tkhsecurity opened 2 years ago
Thank you @bfraterman-tkhsecurity . What you described does look like a valid race condition for FailOpen lock, which can cause longer wait times (maximum leaseDurationMs
) for released locks.
There is no problem with FailClosed lock because there are no heartbeats on those.
I think what we need is a semaphore. Now that you've pointed out the race condition, it is insufficient for release to wait for heartbeat, release and heartbeat can only happen one at a time.
Failure case interleaving with release only waiting for heartbeat to start:
Basically, the self._heartbeatPromise
is used as semaphore. If it exists, releasing the lock will wait until the heartbeat is finished. The other way around is safeguarded by self._released
.
The failure scenario that you describe won't happen in this PR. Node is single-threaded, so no 2 pieces of code will run simultaneous. But whenever a thenable
is used, it will be scheduled in node when doing something asynchronously, giving the possibility to yield to another function or thenable that is scheduled.
So the problem in the original code was:
self._config.dynamodb.put
with a new GUID is called.self._guid
still contains the old GUID. Runs until self._config.dynamodb.put
is called (with the old GUID).put
is done and the rest of the heartbeat code (in the put
callback) runs. self._guid
is set with the new GUID.put
of the release method fails because of the conditional check on GUID mismatch in dynamoDB.In the PR this is fixed by the following:
self._heartbeatPromise
is set. All exit paths out of refreshLock
resolve the Promise. self._released
is checked so we're certain release_start didn't run yet. Runs until self._config.dynamodb.put
is called.self._released
. Checks if self._heartbeatPromise
is set. If it is, it will wait for the promise to resolve.put
is done and the rest of the heartbeat code (in the put
callback) runs. self._guid
is set with the new GUID. The self._heartbeatPromise
is resolved.self._guid
contains the new GUID. Runs until self._config.dynamodb.put
is called (with the new GUID).put
of the release succeeds.But maybe the code would become better readable when using something like https://www.npmjs.com/package/async-mutex. What do you think?
And do you want me to refactor the code to be able to test this in a unit test? This will grow the PR considerably I think, since all heartbeat and releasing code will have to be split up and moved around.
Now that AWS is deprecating the V2 SDK, we're looking at migrating to the V3 SDK.
I wonder if/how this PR can be merged? Otherwise we'll have to keep using our fork for our project.
Hi @bfraterman-tkhsecurity ,
I got distracted and haven't come back to this repo since my last comment.
With a migration to v3, it probably makes sense to use your fork. I don't intend to spend more time on this module in the near future, so it may be worth publishing your v3 migrated fork for the community.
Thank you again for the PR.
Cheers
Releasing of a lock could fail (putItem would fail the condition) when there was a heartbeat running at the same time.
When the 2 calls to dynamoDB happen synchronously, the GUID used by the heartbeat would be put into dynamodb, while the previous GUID would be used to release the lock. Releasing of the lock would therefore fail. I've only tested this for FailOpen locks, where there were longer waits because of this. Perhaps with FailClosed locks the problem would be even worse.
The release of a lock now waits until the heartbeat is finished, using a Promise.
Since it relies on very specific timing it's hard to write a unit test for this.