Open arunstiwari opened 3 years ago
Hi! Thanks for reporting this. I believe that you're having a real issue here and I'd like to reproduce it locally. To do that, I have to be able to run this and run it on my workstation without inventing any details in order to be confident we're seeing the same behavior. As-is, I have tried some simple create/destroys on my M1 Macbook running Big Sur and it worked, and so I'm stuck - I need to understand your specific use case.
Can you please restate your reproduction case such that I can copy-paste it and run it locally? Ideally, this would use the null resource provider rather than a real provider in order to minimize external dependencies.
Also, can you explain what hardware you're using? Is this on an M1 ARM platform or an X86 mac?
I've had a similar issue on my M1 on Big Sur creating a few basic AWS resources. It isn't 100% reproducible yet, sometimes the apply/destroy works, sometimes it does not. I've not seen it on just the null provider yet, but I have seen it with just an aws vpc resource. It may be more than one issue, or a single root cause, I'm not sure. I've seen:
destroy fail with an error: destroy.log apply waiting in a (seemingly) infinite loop: apply.log (killed proc and it still kept looping) destroy failing with rpc error: rpc-destroy.log
TF Version: 0.14.2
TF:
resource "aws_vpc" "main" {
cidr_block = "10.10.0.0/16"
}
provider "aws" {
region = "eu-west-2"
}
Let me know if there is anything more I can provide / do to help.
I am experiencing the same on my M1 MBP. Some runs are lightning fast. And some runs hang.
I believe this has to do with M1's rosetta2 translating the x86-64 plugin. When a run hangs, the process takes 100-140% CPU.
Since investigation requires reproducibility, I made this crude tester:
$ for x in {1..100}; do echo "# $x"; .terraform/providers/registry.terraform.io/hashicorp/aws/3.22.0/darwin_amd64/terraform-provider-aws_v3.22.0_x5 ;done
With this I get reliable hangs. Happening at a random iteration, but they do happen. Following output had a quick hang:
# 1
This binary is a plugin. These are not meant to be executed directly.
Please execute the program that consumes these plugins, which will
load any plugins automatically
# 2
This binary is a plugin. These are not meant to be executed directly.
Please execute the program that consumes these plugins, which will
load any plugins automatically
# 3
This binary is a plugin. These are not meant to be executed directly.
Please execute the program that consumes these plugins, which will
load any plugins automatically
# 4
assertion failed [inst.has_value()]: failed to decode instruction: 0x0
(StateRecovery.cpp:336 determine_state_recovery_action_forward_branches) <-- hangs a long time (10+ minutes?)
assertion failed [inst.has_value()]: failed to decode instruction: 0x0
(StateRecovery.cpp:336 determine_state_recovery_action_forward_branches) <-- again hangs
I have sampled the hanging process. The output is at https://gist.github.com/rtoma/0da8efd5c9204345fb82d72b68100cdc for someone able to interpret it.
Searching for StateRecovery
I found https://github.com/golang/go/issues/42700 (via https://github.com/pulumi/pulumi/issues/5859) suggesting workaround:
export GODEBUG=asyncpreemptoff=1
This stops the issue for me.
Hi! Thanks for reporting this. I believe that you're having a real issue here and I'd like to reproduce it locally. To do that, I have to be able to run this and run it on my workstation without inventing any details in order to be confident we're seeing the same behavior. As-is, I have tried some simple create/destroys on my M1 Macbook running Big Sur and it worked, and so I'm stuck - I need to understand your specific use case.
Can you please restate your reproduction case such that I can copy-paste it and run it locally? Ideally, this would use the null resource provider rather than a real provider in order to minimize external dependencies.
Also, can you explain what hardware you're using? Is this on an M1 ARM platform or an X86 mac?
I am not getting the issue consistently every time. Few times it runs fine but in 20% of the cases it fails executing the same script. My hardware configuration is MacBook Air (M1, 2020) , Chip - Apple M1, RAM - 16 GB, Storage - 1 TB SSD
Searching for
StateRecovery
I found golang/go#42700 (via pulumi/pulumi#5859) suggesting workaround:export GODEBUG=asyncpreemptoff=1
This stops the issue for me.
Thanks.. seems like with this setting it has been working for me. I did not see the hanging issue manifesting after setting the value suggested by you
Searching for
StateRecovery
I found golang/go#42700 (via pulumi/pulumi#5859) suggesting workaround:export GODEBUG=asyncpreemptoff=1
This stops the issue for me.
Can confirm this solved terraform apply stage hanging issue for me on M1 chip also.
@danieldreier So we have diagnosed this 15 months ago. Why does this remain a problem with Terraform? Why haven't the arm64 binaries been fixed?
We see it constantly with Terraform 1.x versions up through 1.1.8 (latest as of this comment) on Big Sur or Monterey.
Per https://github.com/hashicorp/terraform/issues/27350#issuecomment-751268276, this seems to be an issue with Rosetta2 and providers for Terraform that have not been compiled for darwin_arm64
. https://github.com/hashicorp/terraform/issues/27350#issuecomment-751430699 has a work-around, and I would also recommend checking out this Discuss thread for more on M1 + terraform (this is more about tfenv than providers, though). To date, this appears to be an issue with Rosetta2, rather than an issue with terraform itself.
@jorhett can you confirm you are seeing this issue with arm64 terraform and arm64 providers? If so, that would be a new issue as I am understanding this thread.
That environment variable to override some behavior in the Go runtime does seem to be a suggested workaround for a number of different Go runtime bugs, some of which have already been fixed and some are still under investigation. Here are some I'm aware of:
These and others seem to trace back to golang/go#24543, which is a runtime goroutine preemption technique which on Unix systems like macOS is implemented by asking the OS to interrupt the program using signals. Signals can potentially interrupt active system calls, which must then be handled by any code that was waiting for those system calls to complete, and it seems that at least some of these problems have a root cause that some other part of the Go runtime is not properly handling being interrupted in that way.
(I focused above on listing ones that seemed to be related to macOS here, but I note that there are similar problems on other platforms which include this signal-based preemption scheme.)
Emulation layers such as Rosetta add extra tricky interactions to a signal-based scheme like this, because presumably Rosetta is the one initially receiving the signal, translating it to make sense in the emulated amd64 context, and delivering it to the Go runtime. Although about QEMU rather than Rosetta, golang/go#36981 is an example of such a problem which arises only in the case of user-mode emulation where there's an extra level of indirection between the OS and the program.
This GODEBUG=asyncpreemptoff=1
environment variable tells the Go runtime to return to the original co-operative preemption scheme where a busy CPU-intensive loop can potentially block other concurrent code from running altogether, but has the advantage of not relying on OS signals.
With all of this said, it seems like the best we can do here is keep updating to newer versions of the Go runtime as they become available. Given how many different ways these problems seem to be appearing in Go, I don't know that we'll ever be able to confidently say that all problems with this root cause have been addressed, but if we keep this issue focused primarily on the situation of running amd64 Terraform under Rosetta emulation on M1 Macs as @crw proposed then at least we have a more constrained problem space to focus on.
The darwin_amd64
port of Terraform running under Rosetta emulation isn't an officially supported platform, so if you are using Terraform in that way then I would suggest planning to migrate to using the native darwin_arm64
port instead. That may require you to upgrade certain providers or stop using certain obsolete providers which do not have Apple Silicon ports available, so I understand it's not a "quick fix" and you may need to keep using Rosetta emulation in the meantime, but userspace instruction set emulation like this is always a tricky problem and I expect Apple will eventually reach a point where they perceive diminishing returns and stop addressing newly-discovered quirks once there's an Apple Silicon-native port available of most commonly-used software.
@jorhett can you confirm you are seeing this issue with arm64 terraform and arm64 providers?
Sorry, I found that tfenv's confused handling caused us to have intermingled amd/arm64 versions in different repos. Upgrading to 2.2.3 and removing all versions solved it. Ignore my comments about it failing on arm64
Still the case on MacOS Monterey 12.4
$ terraform --version
Terraform v1.1.6
on darwin_amd64
+ provider registry.terraform.io/hashicorp/aws v3.73.0
Your version of Terraform is out of date! The latest version
is 1.2.4. You can update by downloading from https://www.terraform.io/downloads.html
Using darwin_amd64 instead of arm64 with rosetta :(
Sporadically spewing errors:
$ terraform apply
╷
│ Error: Unrecognized remote plugin message:
│
│ This usually means that the plugin is either invalid or simply
│ needs to be recompiled to support the latest protocol.
│
│
╵
$ terraform plan
╷
│ Error: Plugin did not respond
│
│ with provider["registry.terraform.io/hashicorp/aws"].test,
│ on providers.tf line 37, in provider "aws":
│ 37: provider "aws" {
│
│ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ValidateProviderConfig call.
The plugin logs may contain more details.
╵
intermittently it's happening to me too:
MacOS Monterey 12.4 , M1 Pro
terraform --version
Terraform v1.1.7
on darwin_amd64
+ provider registry.terraform.io/hashicorp/aws v4.6.0
Your version of Terraform is out of date! The latest version
is 1.2.5. You can update by downloading from https://www.terraform.io/downloads.html
│ Error: Plugin did not respond
│
│ with provider["registry.terraform.io/hashicorp/aws"],
│ on main.tf line 9, in provider "aws":
│ 9: provider "aws" {
│
│ The plugin encountered an error, and failed to respond to the plugin.(*GRPCProvider).ValidateProviderConfig call. The plugin logs may contain more details.
These two recent messages seem to suggest that it was the provider itself rather than Terraform CLI which crashed, although providers are subject to the same constraints Terraform CLI is: we build the darwin_amd64
packages for use on real amd64 processors and the darwin_arm64
packages for use on M1 processors. Although you may have some success running these packages in the Rosetta emulator, we cannot explicitly support using Terraform in that way.
The best answer is to use the executables that were built for the actual platform you are using.
If you cannot do that for some reason, discussion above suggests that you can tell the Go runtime to use its old scheduling technique by setting the environment variable GODEBUG=asyncpreemptoff=1
, which some have reported will avoid hitting some bugs in Rosetta that can make Terraform or its providers fail when running in that context.
However, this is only a workaround and we do not intend to invest in ensuring that the darwin_amd64
packages fully compatible with all of Rosetta's bugs and quirks, since the opportunity cost is too high. Anyone currently depending on the darwin_amd64
builds of Terraform on an M1 Mac should make plans to migrate to the darwin_arm64
builds at the earliest opportunity, even if the workaround above is working for you in the interim.
Terraform Version
Terraform v0.14.2
Debug Output
Crash Output
Expected Behavior
terraform destroy should have destroyed the created infrastructure.
Actual Behavior
On executing the terraform destroy command on terminal, the command was in hanging state for good amount of time. Then I tried to exit out of shell by doing ctrl+c but it didn't exit
Steps to Reproduce
terraform destroy
Additional Context
I was running this command on macOS Big Sur