pulumi / pulumi-azure-native

Azure Native Provider
Apache License 2.0
125 stars 34 forks source link

Destroy Azure Cosmos DB throwing exception #1266

Open masashi-shib opened 2 years ago

masashi-shib commented 2 years ago

Hello!

Issue details

When running pulumi destroy we get an exception that the Cosmos DB Account has already an ongoing operation, which seems to be the Delete operation from pulumi destroy.

When trying to run it again it will again throw another exception with the same error.

Note: Azure Cosmos DB Account takes a lot of time to be provisioned and to get deleted.

  pulumi:pulumi:Stack :
    error: update failed

  azure-native:documentdb:DatabaseAccount:
    error: Code="PreconditionFailed" Message="There is already an operation in progress which requires exclusive lock on this service xxx. Please retry the operation after sometime.\r\nActivityId: xxxx, Microsoft.Azure.Documents.Common/2.14.0"

package.json

"dependencies": {
    "@pulumi/azure": "^4.0.0",
    "@pulumi/azure-native": "^1.0.0",
    "@pulumi/pulumi": "^3.3.1",

Steps to reproduce

  1. Write code to create a new DB Account return new azureNative.documentdb.DatabaseAccount(...
  2. pulumi up
  3. pulumi destroy

Expected: I am not sure maybe the author can reply :) Actual: Exception is thrown from pulumi

mikhailshilkov commented 2 years ago

Is it just the DatabaseAccount resource that you are creating or any other Cosmos resources? Can you share the particular code for the account? (we have a simple nightly test with it and we don't get these errors). Thank you!

masashi-shib commented 2 years ago

@mikhailshilkov thank you for the response. Basically we are creating the account and also a Database resource.

new documentdb.DatabaseAccount(...)
new documentdb.MongoDBResourceMongoDBDatabase(...)

Strange enough it is not consistently reproduceable.

mikhailshilkov commented 2 years ago

How many regions are you deploying to?

masashi-shib commented 2 years ago

Just one region currently.

masashi-shib commented 2 years ago

AzureCosmosProblem

Following you can find the state of the Cosmos DB Account in Azure Portal. As you can see it stays in deleting state for 5 - 10 minutes after Pulumi has thrown that error. Again it is not consistently reproducible.

justinmchase commented 2 years ago

Any update on this? I am seeing something similar and there is remarkably little about this error message available.

sloncho commented 1 year ago

CosmosDB is very slow to delete. I suspect the reason behind the exception is, that the delete operation times out, and pulumi (or the sdk) retries, and as the resource is already under deletion, it throws. We solved this by using a custom resource options with custom timeouts for the delete operation.

But maybe change the defaults for dbaccount is better.

mikocot commented 1 year ago

@mikhailshilkov is that something you're planning to fix anytime soon? It's been ~2 years since the bug got reported, we ocassionally have the same issue.

mikocot commented 11 months ago

We've seen the same issue with other resources depending on cosmos, this time private endpoint. I guess the issue still persists.

danielrbradley commented 10 months ago

@mikocot it looks like we've not recieved a way to reliably reproduce the issue so this will hamper efforts to find a fix.

From the original conversation here, it appears that this error might have just been related to a delete taking too long, the pulumi deployment timing out, then the next deployment failing because the previous deletion was still in progress, though it's impossible to be sure without the repro.

If you have a way to reproduce a similar issue for another resource, I'd suggest opening that as a new issue.

thomas11 commented 10 months ago

Hi everyone, I gave it another try to reproduce this issue.

I wrote an Automation API program that creates N stacks in parallel. Each one has a CosmosDB account with a database in it, plus a Cosmos MongoDB account with a database in it.

I ran with N up to 30 and in different Azure regions. My results were pretty consistent. It always succeeded. With N=10 it took around 10 minutes total (wall clock) time, with N=30 around 37 minutes.

mikocot commented 9 months ago

@thomas11 @danielrbradley we don't have a reliable way to reproduce it, but my guess is that the cosmos DB needs to be in some kind of use at least that causes the lock. Anyway, for the moment we don't have any more detrails but we also don't see it often.

caboog commented 7 months ago

Hi all. We are closing this issue as there is no reliable way to reproduce it. If you find a way to repro this, please open a new issue with those steps.

Thnaks.

justinmchase commented 7 months ago

Normally what you would do is leave the issue open until its fixed. People will come here and keep adding more and more context until a reliable reproduction can be found.

danielrbradley commented 6 months ago

@justinmchase agreed, we'll leave this open for visibility.

If anyone who's experienced this can let us know if it's been fixed upstream, we'll then close this to clear the issue backlog.

serpentfabric commented 2 months ago

@justinmchase agreed, we'll leave this open for visibility.

If anyone who's experienced this can let us know if it's been fixed upstream, we'll then close this to clear the issue backlog.

we just ran into it again yesterday... hth...

jirikopecky commented 2 weeks ago

Hello, this recently started to happen pretty much with every CosmosDB destroy, fix would be much appreciated.

thomas11 commented 2 weeks ago

Hi @jirikopecky, if it happens very reliably for you, it would be very helpful if you capture verbose logs. They will contain some data like your subscription id, though, so might want to replace it or filter the log down to the HTTP requests and responses to/from Azure.

jirikopecky commented 2 weeks ago

We use Github actions, so to my knowledge its not supported to do so - https://github.com/pulumi/actions/issues/589

But to sum things up: