pulumi / pulumi-azure-native

Azure Native Provider
Apache License 2.0
126 stars 33 forks source link

Provider crashes when handling several stack updates #547

Open yarinm opened 3 years ago

yarinm commented 3 years ago

We're running several stacks in parallel (~10) and they occasionally crash:

Diagnostics:
  azure-nextgen:storage/latest:BlobContainer (disk-blob-container3):
    error: transport is closing

  azure-nextgen:storage/latest:BlobContainer (disk-blob-container6):
    error: connection closed

  pulumi:pulumi:Stack (wiz-diskanalyzer-prod_azure_wiz-managed_francecentral_infra):
    error: update failed

  azure-nextgen:storage/latest:BlobContainer (disk-blob-container4):
    error: transport is closing

  azure-nextgen:storage/latest:BlobContainer (disk-blob-container5):
    error: transport is closing

This causes the stack to fail while creating some of the resources and not saving the to state. I need to fix the stack manually before running again.

We're invoking the stacks using the automation API, we're updating under the same loads AWS and GCP stacks and we never see these issues there.

lblackstone commented 3 years ago

My initial thought is that this could be related to memory consumption. The azure-nextgen provider in particular has high memory requirements, so perhaps something is OOM'ing. Do you see the same problem running at lower parallelism?

yarinm commented 3 years ago

@lblackstone I just checked and it seems that it also happens on single stack updates ! I had several fails in the last few days where there was only one azure stack being updated at the time.

I recently added creation of several blob storage accounts in a single stack - could that cause that? I'm limiting my pulumi pods to 4GB memory and it doesn't seem that we're hitting that limit since the pod isn't crashing.

So I'm suspecting there's a different issue here..

I'm using version v0.2.8 as future versions are not compiling...

lblackstone commented 3 years ago

@mikhailshilkov Any ideas?

viveklak commented 3 years ago

@yarinm sounds like you are using Go? Any chance you can get some more logs and share? Single stack resulting in the failure would be ideal. If you are using automation api you can leverage DebugLogging to do so https://pkg.go.dev/github.com/pulumi/pulumi/sdk/v2@v2.21.2/go/x/auto/optup#Option. Specifically recommend setting LogLevel to 9 and FlowToPlugin to true so we get the provider logs as well. https://pkg.go.dev/github.com/pulumi/pulumi/sdk/v2@v2.21.2/go/x/auto/debug#LoggingOptions

yarinm commented 3 years ago

@viveklak I tried adding the debug logs as you suggested but I'm getting this message in the logs

Diagnostics:
  pulumi:pulumi:Stack (wiz-diskanalyzer-dev_azure_wiz-managed_northeurope_infra):
    flag provided but not defined: -v
    Usage of tf-provider-flags:
      -get-provider-info
                dump provider info as JSON to stdout
      -version

I'm guessing this is because I'm using an older version of the provider? I can't upgrade at the moment because the last few versions of azure-nextgen SDKs aren't compiling in go..

I'm using pulumi 2.20.0

Edit: this actually happens with other providers as well (e.g aws)

mikhailshilkov commented 3 years ago

This error has nothing to do with the version of the provider. Are you able to set the -v=9 flag?

mikhailshilkov commented 3 years ago

Oh, sorry, my bad, you already set it.

viveklak commented 3 years ago

@yarinm it seems its being passed through to the provider. I will file a bug separately to fix that but you should be able to ignore that and also set the LogToStdErr boolean flag to get logs.

yarinm commented 3 years ago

I need the logs to run through normally because I need our logging agent to pull the logs properly (stdout/stderr are not formatted well)

yarinm commented 3 years ago

@mikhailshilkov

Another interesting issue I'm seeing - blob containers always have a diff on optional arguments:

 disk-blob-container updated [diff: +defaultEncryptionScope,denyEncryptionScopeOverride]

My container args are defined as follow:

&storage.BlobContainerArgs{
        AccountName:                 account.Name,
        ContainerName:               pulumi.String("myname"),
        PublicAccess:                pulumi.StringPtr("None"),
        ResourceGroupName:           rg.Name,
    }

AFAIK it shouldn't see this as a diff

mikhailshilkov commented 3 years ago

@yarinm Do you run refresh between an up and diff?

yarinm commented 3 years ago

@mikhailshilkov yes, always

viveklak commented 3 years ago

@yarinm it seems its being passed through to the provider. I will file a bug separately to fix that but you should be able to ignore that and also set the LogToStdErr boolean flag to get logs.

Issue tracked here: https://github.com/pulumi/pulumi/issues/6451

Improving debug logging support across APIs tracked here: https://github.com/pulumi/pulumi/issues/5855

yarinm commented 3 years ago

@mikhailshilkov @EvanBoyle @lblackstone I have to say this plugin doesn't feel production-grade compared to other pulumi plugins.

These crashes are causing us SO MUCH issues and we have to manually fix dozens of azure stacks. In AWS / GCP we have ZERO issues, in azure we have them ALL THE TIME.

We also occasionally get CPU / MEM spikes from the process and it gets our pod to be killed in the process. Is there a way you can help us debug this? Can we prioritize pulumi/pulumi#6451 so we can debug this using automation API?

viveklak commented 3 years ago

I caught up with @yarinm offline and he has provided a few stack traces which fairly consistently point to memory pressure within the pod causing pulumi processes to crash during memory allocations (ranging from the CLI to provider). The issue might be exacerbated due to azure-native provider's baseline memory footprint being 5x higher than other providers (e.g. 280 MB vs 40-50MB between azure-native and azure classic) and concurrent runs on stacks. Since automation api will result in launches of CLI, language runtime and providers for each stack, the memory usage can spike suddenly. A short term improvement here would be to limit the concurrency on each pod.

A follow up action from here is to prioritize reducing memory consumption (see https://github.com/pulumi/pulumi-azure-native/issues/603).