madelson / DistributedLock

A .NET library for distributed synchronization
MIT License
1.86k stars 192 forks source link

Transient error leads to unhandled 409 when releasing an Azure lease #132

Open rmt2021 opened 2 years ago

rmt2021 commented 2 years ago

Describe the bug

We encounter an unhandled exception thrown by Release or ReleaseAsync method when transient errors happen. The exception is thrown from the Azure SDK API Release inside ReleaseAsync with an error code 409 (LeaseIdMismatchWithLeaseOperation):

...
    this._blobLeaseClient.Release();
    return default;
...

When releasing the lease, transient errors (e.g., timeout or package dropped) may happen during the transmission of the response. Release SDK API may retry releasing the lease multiple times until the lease is successfully released. This retry mechanism can lead to a 409 error even if the lease has already been released. One concrete scenario could be:

  1. BlobLeaseClientWrapper.ReleaseAsync sends an HTTP request to release the lease first;
  2. The lease is released by this first try, but a timeout happens, so the HTTP response never reaches the DistributedLock;
  3. The SDK API Release will automatically retry the release operation due to the timeout;
  4. The second try fails and encounters a 409 LeaseIdMismatchWithLeaseOperation error since the lease has already been released by the first request;
  5. The Azure SDK throws this 409 exception to DistributedLock, and ReleaseAsync in DistributedLock would not catch it (because there is no try-catch block in ReleaseAsync).

To reproduce

When the first release request sent by Release in ReleaseAsync succeeds in remote but its response never reaches the client, LeaseIdMismatchWithLeaseOperation will appear in the response of the second release request, which results in an unhandled exception.

Expected behavior & Possible Solution

Add a catch block of ReleaseAsync that catches the RequestFailedException and check whether it is a 409 response. If it is, the ReleaseAsync may need to handle the exception e.g., ignore it and print some logs.

Additional Context

We are willing to submit a PR to fix it.

madelson commented 2 years ago

@rmt2021 thanks for filing and for the detailed report!

If I understand correctly, the retry behavior happens within the Azure SDK and results in these spurious 409 errors, Your proposal is to swallow the 409s.

One concern I have here is that if we make this change we lose the ability to detect the difference between this case and misuse of the API. Perhaps that's just a pill we have to swallow, but I wonder if this issue could be fixed internally within the Azure SDK's retry functionality? It seems kind of silly for them to do a retry on failure and then throw when the retry "succeeds".

madelson commented 2 years ago

@rmt2021 I've attempted to file this as a bug report in Azure SDK as a starting point. If you could add any additional details there that would be helpful. What they say and whether they are willing to fix will determine what DistributedLock can/should do about this.

rmt2021 commented 2 years ago

Sorry for the late reply @madelson.

Do I understand the issue and proposed solution correctly?

Yes!

Have you filed anything with Azure SDK for this? What did they say?

Actually I have ask them here: https://github.com/Azure/azure-sdk-for-net/issues/29673, but no one has replied yet.

Is there something that should mitigate my concern about swallowing real errors with this change?

Your concern indeed makes a lot of sense. Another way is to check whether the lease exists before releasing it, but I don't know how to achieve this in an atomic manner, and there is no API like ReleaseIfExists.

Thanks for creating a bug report in Azure SDK!