hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.71k stars 9.55k forks source link

TrendMicro system call interception causes plugin handshake to hang for provider plugins built with Go 1.14 for Windows systems #25133

Open jf781 opened 4 years ago

jf781 commented 4 years ago

Opening an issue here as requested in here

Getting an issue on some Windows 10 workstations where they get the error when trying to run terraform.

Error: Failed to instantiate provider "azurerm" to obtain schema: timeout while waiting for plugin to start

This does not occur when running the same code on Mac or Linux.

Terraform Version

0.12.24 Tried azurerm providers 2.7.0 - through 2.12.0 with same results. Code gives error on 2.6.0 provide because we are using "address_prefixes" in the Azurerm_subnet resource.

Terraform Configuration Files

Happening with all modules. The same modules run fine on Linux and Mac OS.

Debug Output

Do not have at this time.

Crash Output

Do not have at this time.

Expected Behavior

Terraform plan would run normally

Actual Behavior

Receive the error.

Error: Failed to instantiate provider "azurerm" to obtain schema: timeout while waiting for plugin to start

Steps to Reproduce

Run terraform plan

Additional Context

The issue can be hit or miss. It is happening on some workstations but not on all.

References

https://github.com/terraform-providers/terraform-provider-azurerm/issues/6690

apparentlymart commented 4 years ago

Hi @jf781,

Unfortunately without the full debug logs there's quite a large space of possibilities for what might cause this behavior, but the one we've encountered most commonly in the past is when folks are running Terraform on Windows machines with software firewalls installed.

Terraform provider plugins are child processes which Terraform Core interacts with via sockets on the local loopback interface, and some software firewalls prevent packets from reaching a local listen socket, either blocking it altogether or requiring a manual approval to do so.

Do you have any firewall software on your Windows system that might be blocking requests like these?

drdamour commented 4 years ago

@apparentlymart did u see the original ticket this just started happening with a specific version of azurerm provider. It def looks firewall related

drdamour commented 4 years ago

Are there details about which ports the providers are trying to open? Its all gRPC still yes? I saw some old posts about a env cookie to pass to try to see the ports but didnt seem to work.

apparentlymart commented 4 years ago

Hi @drdamour,

By default the plugin processes ask the OS to allocate a listen port automatically, so there is no predictable port number. However, it looks like the underlying library has environment variables PLUGIN_MIN_PORT and PLUGIN_MAX_PORT which you can set together to constrain it to try to use a particular range of ports.

If you set them both to the same value then I think that would force it to try that specific port number, assuming it's available, in case having that constrained is helpful for debugging. (Though if your configuration calls for multiple plugins you'll need at least enough distinct port numbers in the range for them all to listen at once, so if you're using this debug strategy it'd be best to do it with a very minimal configuration that only uses one plugin.)

danieldreier commented 4 years ago

Hello!

I've tried to reproduce this issue, and confirmed for myself that on a Windows 10 VM with Terraform 0.12.26, I'm able to use Terraform and the AzureRM provider to create and destroy a resource group.

We exclusively use GitHub issues for tracking bugs and enhancements, rather than for questions. While we can sometimes help with certain simple problems here, it's better to use the community forum where there are more people ready to help.

Based on the information you've provided, it looks like this doesn't represent a specific bug or feature request, even though I am empathetic to what sounds like a very difficult troubleshooting situation, and so I'm going to close it. I hope that you and @apparentlymart can continue troubleshooting in the community forum - please open a thread there and feel free to post that back here after it's closed to link them together. I don't want to shut down this conversation.

drdamour commented 4 years ago

@danieldreier there are multiple people experiencing this issue and only with a recent version of azurerm i think this should remain open

drdamour commented 4 years ago

It never happens with azure rm 2.6 and always happens with later versions

danieldreier commented 4 years ago

Sure! I'll re-open it.

sai-ns commented 4 years ago

@apparentlymart I have uploaded 2 trace logs to https://github.com/sai-ns/terraformlogs if that is what you are waiting for. terraform plan works like 1 in 100 times but fails when I run terraform apply. Just like everyone else, I started seeing this issues since azurerm 2.7. I was waiting for few bugs to be fixed in 2.7 and 2.8 and when I am ready to test it, terraform itself starting throwing these errors.

sai-ns commented 4 years ago

Terraform team, any update on this, please?

apparentlymart commented 4 years ago

The relevant part of the trace log output seems to be this part:

2020/06/19 17:05:11 [TRACE] terraform.NewContext: starting
2020/06/19 17:05:11 [TRACE] terraform.NewContext: resolving provider version selections
2020/06/19 17:05:12 [TRACE] terraform.NewContext: loading provider schemas
2020/06/19 17:05:12 [TRACE] LoadSchemas: retrieving schema for provider type "azurerm"
2020-06-19T17:05:12.119-0500 [INFO]  plugin: configuring client automatic mTLS
2020-06-19T17:05:12.156-0500 [DEBUG] plugin: starting plugin: path="C:\Users\Sai Nooli\.terraform\plugins\windows_amd64\terraform-provider-azurerm_v2.15.0_x5.exe" args=["C:\Users\Sai Nooli\gitrepos\kvtest\.terraform\plugins\windows_amd64\terraform-provider-azurerm_v2.15.0_x5.exe"]
2020-06-19T17:05:12.162-0500 [DEBUG] plugin: plugin started: path="C:\Users\Sai Nooli\.terraform\plugins\windows_amd64\terraform-provider-azurerm_v2.15.0_x5.exe" pid=1116
2020-06-19T17:05:12.162-0500 [DEBUG] plugin: waiting for RPC address: path="C:\Users\Sai Nooli\.terraform\plugins\windows_amd64\terraform-provider-azurerm_v2.15.0_x5.exe"
2020-06-19T17:05:12.298-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "Advisor"..
2020-06-19T17:05:12.298-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "Advisor"..
2020-06-19T17:05:12.298-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "Analysis Services"..
2020-06-19T17:05:12.298-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "Analysis Services"..
2020-06-19T17:05:12.299-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "API Management"..
2020-06-19T17:05:12.299-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "API Management"..
2020-06-19T17:05:12.301-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "App Configuration"..
2020-06-19T17:05:12.301-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "App Configuration"..
2020-06-19T17:05:12.301-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "App Platform"..
2020-06-19T17:05:12.301-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "App Platform"..
2020-06-19T17:05:12.301-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "Application Insights"..
2020-06-19T17:05:12.301-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "Application Insights"..
2020-06-19T17:05:12.301-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "Authorization"..
2020-06-19T17:05:12.301-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "Authorization"..
2020-06-19T17:05:12.301-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "Automation"..
2020-06-19T17:05:12.302-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "Automation"..
2020-06-19T17:05:12.303-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "Batch"..
2020-06-19T17:05:12.303-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "Batch"..
2020-06-19T17:05:12.303-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "Blueprints"..
2020-06-19T17:05:12.303-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "Blueprints"..
2020-06-19T17:05:12.303-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "Bot"..
2020-06-19T17:05:12.303-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "Bot"..
2020-06-19T17:05:12.303-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Data Sources for "CDN"..
2020-06-19T17:05:12.303-0500 [DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: [DEBUG] Registering Resources for "CDN"..
2020/06/19 17:06:12 [TRACE] statemgr.Filesystem: removing lock metadata file .terraform.tfstate.lock.info
2020/06/19 17:06:12 [TRACE] statemgr.Filesystem: unlocked by closing terraform.tfstate
2020-06-19T17:06:12.168-0500 [WARN]  plugin: plugin failed to exit gracefully
2020-06-19T17:06:12.175-0500 [DEBUG] plugin: plugin process exited: path="C:\Users\Sai Nooli\.terraform\plugins\windows_amd64\terraform-provider-azurerm_v2.15.0_x5.exe" pid=1116 error="exit status 1"

From this output we can see that the provider plugin was indeed launched correctly, because we can see it running and producing its own debug messages.

The next thing that's supposed to happen is the response to the log line plugin: waiting for RPC address. If this were working correctly then the next log lines should look something this:

[INFO]  plugin.terraform-provider-azurerm_v2.15.0_x5.exe configuring server automatic mTLS: timestamp=2020-07-07T16:47:59.771-0700
[DEBUG] plugin: using plugin: version=5
[DEBUG] plugin.terraform-provider-azurerm_v2.15.0_x5.exe: plugin address: address=127.0.0.1:45345 network=tcp timestamp=2020-07-07T16:47:59.804-0700

The fact that we are not seeing messages like these seems to suggest that the provider plugin is not initializing correctly. Normally when we've seen behavior like this caused by firewall problems we'd see messages like the above and then Terraform would fail trying to connect to (for example) 127.0.0.1:45345, but this situation seems different in that the provider doesn't seem to even be completing the plugin handshake.

I'm not sure what exactly to investigate next, but hopefully we can identify something that changed in the provider between versions 2.6 and 2.7 that would give a clue as to why the provider started behaving differently. The fact that relatively few people seem to have encountered this problem (based on the number of comments and votes on this issue compared to the number of folks out there using Terraform on Windows) makes me suspect that this is still caused by something environment-specific, but clearly it's something a little different than the typical firewall interactions we've seen in the past.

apparentlymart commented 4 years ago

I used rsc.io/goversion to try to see what Go runtime versions and Go modules the 2.6.0 and 2.7.0 builds of the AzureRM provider.

Working with the official Windows builds of each I see the following:

Unfortunately it seems that goversions wasn't able to extract module version information from the Windows executables. I'm not sure why that is (I've not tried this tool with Windows executables before) but to get as much information as possible I compared the module versions from the Linux executables because I know that the official provider release process cross-compiles all of the executables from the same sources and thus the modules used in each should be the same:

--- 2.6.0
+++ 2.7.0
@@ -1,9 +1,10 @@
-2.6.0/terraform-provider-azurerm_v2.6.0_x5 go1.12.6
+2.7.0/terraform-provider-azurerm_v2.7.0_x5 unknown Go version
    path  github.com/terraform-providers/terraform-provider-azurerm
    mod   github.com/terraform-providers/terraform-provider-azurerm  (devel)
-   dep
    dep   cloud.google.com/go                                        v0.45.1
    dep   github.com/Azure/azure-sdk-for-go                          v41.2.0+incompatible
+   dep   github.com/Azure/go-autorest/autorest                      v0.10.0
+   =>    github.com/tombuildsstuff/go-autorest/autorest             v0.10.1-0.20200416184303-d4e299a3c04a
    dep   github.com/Azure/go-autorest/autorest/adal                 v0.8.2
    dep   github.com/Azure/go-autorest/autorest/azure/cli            v0.3.0
    dep   github.com/Azure/go-autorest/autorest/date                 v0.2.0
@@ -76,4 +77,4 @@
    dep   google.golang.org/api                                      v0.9.0
    dep   google.golang.org/genproto                                 v0.0.0-20190819201941-24fa4b261c55
    dep   google.golang.org/grpc                                     v1.23.0
-   dep   gopkg.in/yaml.v2                                           v2.2.2
+   dep   gopkg.in/yaml.v2                                           v2.2.4

None of the modules that changed seem like that should affect how the Terraform plugin handshake behaves, so I'm going to dismiss module changes as a possible cause for now and focus on the possibility that Go runtime changes might've affected the behavior somehow, because AzureRM provider 2.7 was built with a Go two minor versions later than AzureRM provider 2.6 and so there may have been notable behavior changes to the Go runtime between those two releases.


I see one change recorded in the Go 1.13 release notes which alludes to the possibility that system call behavior could've changed, though I don't know enough details to decide whether this is relevant to the behavior we're discussing here:

The Windows version specified by internally-linked Windows binaries is now Windows 7 rather than NT 4.0. This was already the minimum required version for Go, but can affect the behavior of system calls that have a backwards-compatibility mode. These will now behave as documented. Externally-linked binaries (any program using cgo) have always specified a more recent Windows version.

I don't see anything else in the 1.13 and 1.14 release notes that stands out as something that could affect the behavior of plugins running on Windows.

I have run out of time to work on this for today, and unfortunately we have no new leads. If someone else picks up this issue in future, a possible next direction could be to try to understand in more detail which backward-compatibility behaviors on Windows are controlled by the binaries specifying Windows 7 rather than NT 4.0 as the target version, to see if any of them seem like they might affect system calls that the Terraform plugin system would be using on Windows systems.

sai-ns commented 4 years ago

@apparentlymart thank you very much for your time and detailed explanation. I was earlier concentrating on troubleshooting from the network perspective but looks like it is something else. I agree this is environment specific since no one else in our org are seeing this behavior. I will try to troubleshoot from my side as well and keep you all posted.

drdamour commented 4 years ago

all the devs in our env who terraform (~50) are experiencing it..so it's def something to do with our draconian security setup. it's funny that there's no issue with WSL though...

sai-ns commented 4 years ago

Update- I recently started seeing issues with running GO(v1.14.4) in my laptop and based on @apparentlymart research, it does seem too be related. I've had my desktop support team uninstall/disable antivirus(trend micro in my case) from my machine and everything started working just fine. We are now trying to see what is being blocked and what can be done on the antivirus to have this working.

drdamour commented 4 years ago

ah we run trend too, so i bet that's the culprit...

apparentlymart commented 4 years ago

Thanks for that extra context, @sai-ns and @drdamour!

With the lead about Trend Micro I found golang/go#36492, where some folks are discussing a situation where the system call interception in Trend Micro causes a deadlock between two threads. The effect of that would be to hang the process, which seems to agree with the behavior we're seeing here in the Azure provider too.

If the Azure provider behavior we're seeing here has the same root cause as golang/go#36492 then unfortunately it seems to be deep inside the Go runtime and not something we on the Terraform team can directly influence, but it does seem like the participants in that issue are investigating possible approaches to avoid it and so hopefully there will be a fix available in a future version of Go. The investigation work on that issue might also yield some possible workarounds other than disabling Trend Micro altogether, but we'll need to wait and see where that discussion leads.

This behavior is in the Azure provider rather than in Terraform Core and so technically this issue would be better placed in the Azure provider repository (Terraform Core is not using Go 1.14 yet), though since it could presumably apply to any Go program built with Go 1.14 -- including all of the Terraform providers and Terraform Core itself -- I think we'll keep this open here for now to track the issue in the broad sense and then if we find more information later that suggests a specific solution that the Azure provider team could take we can consider opening an issue over in their repository once we'd have something more concrete to share in it.

Thanks again, all! If you do learn anything else in your investigations with Trend Micro, please share! (The Go team might also appreciate hearing about what you learn in that issue, though it's probably best to discuss here first so that we can help figure out which parts of the information are generic to all Go programs and which are specific to Terraform before sharing those details.)

drdamour commented 4 years ago

@apparentlymart well we DID open it there, and then they told us to open it here https://github.com/terraform-providers/terraform-provider-azurerm/issues/6690#issuecomment-638739832 by rules of first, it MUST be here ;)

but really, thanks a LOT for that investigation that's exactly the type of data i need to go back to my security team about trend!

blindzero commented 4 years ago

I can fully confirm that it is about TrendMicro. As soon as I disable this I don't have any issues anymore (I did first reported in terraform-providers/terraform-provider-azurerm#8466). I will also create a case at TrendMicro.

shreekant commented 3 years ago

Hi All, How about this issue on Microsoft hosted ubuntu agent? Anyone could fix this issue?

drdamour commented 3 years ago

@blindzero u have any luck with trend ticket?

blindzero commented 3 years ago

Yes and no, it is really hard to reproduce because not occuring all the time. It could be reproduced, did not occur at all when deactivating most TM features. But exclusions did not work because of missing variable exclusion

grv87 commented 3 years ago

I had the similar issue with GitLab provider, timeout while waiting for plugin to start. Antivirus is turned off. Logs say nothing about locks, looks just like the provider needs more time.

Default timeout is 1 minute.

Fixed it manually, adding here https://github.com/hashicorp/terraform/blob/bab49791284a2670342c28413ef651b282e56d2d/command/meta_providers.go#L318-L319 override:

StartTimeout:     5 * time.Minute,

Actual time is around 1 minute 15 seconds for now.

Maybe Terraform could have a config or a command line arg for that.

But I see it as just a hack. I think it's either #27170 or an error in provider' design (intensive work should not be done in the start phase).

ketzacoatl commented 3 years ago

As mentioned in the linked issue - hashicorp/terraform-provider-nomad#198 - we've seen this timeout with the nomad provider, on linux CI hosts running on AWS (with no trendmicro antivirus).

ketzacoatl commented 3 years ago

Does anyone know when/why or under what circumstances, this bug is triggered? How does one avoid it? I'm seeing it frequently, but not all the time (about 1 in 4).

suzuki-shunsuke commented 3 years ago

Similar issue is reported as https://github.com/kreuzwerker/terraform-provider-docker/issues/138 .

drdamour commented 3 years ago

@blindzero can u share your trend ticket number? i'd liek to refer to it in our case

blindzero commented 3 years ago

@drdamour: sorry, although I fully understand why you are asking I cannot share this company related information. But I escalated this to our account managers as well ;)

drdamour commented 3 years ago

something anecdotal i noticed is that it seems to occur more frequently when i have azurestorageemulator running, this thing reserves ports 10000, 10001, and 10002 for it's evil deeds. it seems like these are ports that the provider tries to own sometimes (when i have hung provider process storage emulator complains it can't get those ports) anyone else experiencing this seen anything similar where something may be using similar ports?

ketzacoatl commented 3 years ago

I can confirm this still an issue with Terraform 0.14.x as well as 0.15.1.

Is this possibly a resource issue? If the plugin is too slow to start?

I see this often (but intermittently) in CI, but never locally.

Is the timeout configurable? Should it be raised? Or is it like 30s and there's actually a different problem going on.

Apologies for being a bit ignorant on the details here, I don't really know what's at play.

ketzacoatl commented 3 years ago

@danieldreier / @apparentlymart, maybe you could help me better understand: is this an issue with Terraform core, the plugin, or golang? I'm having a difficult time following the details across the discussion.

preethamrajk commented 3 years ago

Hi, is there an update on this issue?

davidcaballeroAud commented 3 years ago

Got this exact same issue on terraform 0.14.5 and different provider:

Failed to instantiate provider "registry.terraform.io/mongey/kafka" to obtain
schema: timeout while waiting for plugin to start

Is the timeout configurable?

jwenz723 commented 1 year ago

I experienced this issue today using azurerm provider version 3.79.0 and terraform version 1.3.1.

What I found is that the azurerm provider tries to automatically register all resource providers which are needed. It seems like it will continue to loop while any resource provider is in a registering state. For some reason I had 1 resource provider which was stuck in the registering state. I had to go into the Azure portal and retry the registration of the resource provider. Once it was in a registered state the azurerm provider continued past the registration loop and successfully executed.

apparentlymart commented 5 months ago

Hi all!

We haven't heard about this problem for a while now, but I notice that the upstream Go issue https://github.com/golang/go/issues/36492 is still open and so I'm assuming nothing was intentionally changed in the Go runtime to correct this. I also can't find any further reports about this in the Go repository that are newer than the most recent comment on this issue, in Nov 2023.

Since this problem was bumping into what is arguably a bug in TrendMicro's approach to intercepting system calls, I wonder if they've fixed that problem in the meantime and thus the Go runtime's behavior is no longer triggering that bug.

The Terraform team does not have access to the TrendMicro software and so we have no way to reproduce or test this problem. Do any of you who work at companies that use TrendMicro have any insight into whether something has changed here? I see in discussion above that some of you opened tickets with TrendMicro support about the problem, and so I wonder if those tickets gave you any signal about the status of this problem.

Thanks!

drdamour commented 5 months ago

We moved from trend to defender and this problem stopped..doesnt answer your question unfortunately