Open yarinm opened 3 years ago
My initial thought is that this could be related to memory consumption. The azure-nextgen provider in particular has high memory requirements, so perhaps something is OOM'ing. Do you see the same problem running at lower parallelism?
@lblackstone I just checked and it seems that it also happens on single stack updates ! I had several fails in the last few days where there was only one azure stack being updated at the time.
I recently added creation of several blob storage accounts in a single stack - could that cause that? I'm limiting my pulumi pods to 4GB memory and it doesn't seem that we're hitting that limit since the pod isn't crashing.
So I'm suspecting there's a different issue here..
I'm using version v0.2.8
as future versions are not compiling...
@mikhailshilkov Any ideas?
@yarinm sounds like you are using Go? Any chance you can get some more logs and share? Single stack resulting in the failure would be ideal.
If you are using automation api you can leverage DebugLogging
to do so https://pkg.go.dev/github.com/pulumi/pulumi/sdk/v2@v2.21.2/go/x/auto/optup#Option. Specifically recommend setting LogLevel
to 9 and FlowToPlugin
to true
so we get the provider logs as well. https://pkg.go.dev/github.com/pulumi/pulumi/sdk/v2@v2.21.2/go/x/auto/debug#LoggingOptions
@viveklak I tried adding the debug logs as you suggested but I'm getting this message in the logs
Diagnostics:
pulumi:pulumi:Stack (wiz-diskanalyzer-dev_azure_wiz-managed_northeurope_infra):
flag provided but not defined: -v
Usage of tf-provider-flags:
-get-provider-info
dump provider info as JSON to stdout
-version
I'm guessing this is because I'm using an older version of the provider? I can't upgrade at the moment because the last few versions of azure-nextgen SDKs aren't compiling in go..
I'm using pulumi 2.20.0
Edit: this actually happens with other providers as well (e.g aws)
This error has nothing to do with the version of the provider. Are you able to set the -v=9
flag?
Oh, sorry, my bad, you already set it.
@yarinm it seems its being passed through to the provider. I will file a bug separately to fix that but you should be able to ignore that and also set the LogToStdErr
boolean flag to get logs.
I need the logs to run through normally because I need our logging agent to pull the logs properly (stdout/stderr are not formatted well)
@mikhailshilkov
Another interesting issue I'm seeing - blob containers always have a diff on optional arguments:
disk-blob-container updated [diff: +defaultEncryptionScope,denyEncryptionScopeOverride]
My container args are defined as follow:
&storage.BlobContainerArgs{
AccountName: account.Name,
ContainerName: pulumi.String("myname"),
PublicAccess: pulumi.StringPtr("None"),
ResourceGroupName: rg.Name,
}
AFAIK it shouldn't see this as a diff
@yarinm Do you run refresh
between an up
and diff
?
@mikhailshilkov yes, always
@yarinm it seems its being passed through to the provider. I will file a bug separately to fix that but you should be able to ignore that and also set the
LogToStdErr
boolean flag to get logs.
Issue tracked here: https://github.com/pulumi/pulumi/issues/6451
Improving debug logging support across APIs tracked here: https://github.com/pulumi/pulumi/issues/5855
@mikhailshilkov @EvanBoyle @lblackstone I have to say this plugin doesn't feel production-grade compared to other pulumi plugins.
These crashes are causing us SO MUCH issues and we have to manually fix dozens of azure stacks. In AWS / GCP we have ZERO issues, in azure we have them ALL THE TIME.
We also occasionally get CPU / MEM spikes from the process and it gets our pod to be killed in the process. Is there a way you can help us debug this? Can we prioritize pulumi/pulumi#6451 so we can debug this using automation API?
I caught up with @yarinm offline and he has provided a few stack traces which fairly consistently point to memory pressure within the pod causing pulumi processes to crash during memory allocations (ranging from the CLI to provider). The issue might be exacerbated due to azure-native provider's baseline memory footprint being 5x higher than other providers (e.g. 280 MB vs 40-50MB between azure-native and azure classic) and concurrent runs on stacks. Since automation api will result in launches of CLI, language runtime and providers for each stack, the memory usage can spike suddenly. A short term improvement here would be to limit the concurrency on each pod.
A follow up action from here is to prioritize reducing memory consumption (see https://github.com/pulumi/pulumi-azure-native/issues/603).
We're running several stacks in parallel (~10) and they occasionally crash:
This causes the stack to fail while creating some of the resources and not saving the to state. I need to fix the stack manually before running again.
We're invoking the stacks using the automation API, we're updating under the same loads AWS and GCP stacks and we never see these issues there.