vcd provider v2.4.0 regression, edge gw config does not complete, waits forever

pasikarkkainen commented 5 years ago

Terraform Version

Terraform: v0.11.14 OS: Linux CentOS 7.6 x86_64

Terraform vcd provider Version

vcd provider: v2.4.0 vCloud Director: 9.7

Affected Resource(s)

vcd_dnat vcd_firewall_rules

Expected Behavior

vcd provider should be able to successfully provision Edge Gateway Firewall rules and NAT rules.

Actual Behavior

vcd provider gets "stuck" waiting forever, provisioning does not finish, resources don't get provisioned, unless one cancels terraform (with ctrl+c) and re-runs "terraform apply" many times.

Problem description:

vCD user: Tenant Org Admin.

Terraform vcd provider v2.2.0 and v2.3.0 work OK for me, but v2.4.0 does not seem to work in my environment.

Edge Gateway operations like Firewall rules, NAT rules, LAN creation wait forever without ever completing (stuck in "Still creating..." for vcd_dnat and vcd_firewall_rules objects). I've been waiting for hours, but it doesn't help, vcd provider is stuck waiting and nothing happens.

It seems only a few edge gw changes succeed per "terraform apply" run, and the rest get stuck waiting forever.. This seems fully reproducible in my environment. Basicly I need to cancel terraform (with ctrl+c) and re-run it many times until all the edge gw config gets done. Finally in the, after many re-runs, all the edge gateway settings are provisioned successfully.

I noticed the changelog mentioned "locking changes" in v2.4.0, so I wonder if those are related..

Any tips how to troubleshoot this? I can't spot anything obvious when running with TF_LOG=DEBUG/TRACE.

Didainius commented 5 years ago

Thanks for the issue @pasikarkkainen. Could you share the TF_LOG in a gist (just check it doesn't contain your data). Also could you share a simplified config which hits the problem?

pasikarkkainen commented 5 years ago

@Didainius Yep, I'm working on a minimal config which triggers the issue. I'll post the config and logs when done.

pasikarkkainen commented 5 years ago

@Didainius here we go: main.tf: http://pasik.fi/terraform-vcd-provider-v2.4.0-edge_gw-config-regression-main.tf terraform apply log: http://pasik.fi/terraform-vcd-provider-v2.4.0-edge_gw-config-regression-log.txt

I let it run for some ~1.5h, but it won't succeed even if it runs overnight. with vcd provider v2.3.0 using the exact same main.tf (well, the vcd provider version string changed of course) succesfully completes in a couple of minutes, every time.

I tried multiple times with vcd provider v2.4.0, and sometimes on the first run it manages to create for example the routed network, but not always. In the log file I linked the routed network was not created. It seems to be pretty random. In any case vcd provider 2.4.0 seems to get "stuck" every time though, and never completes on the first run (it only completes if I ctrl+c and re-run it a few times).

Info about the pre-provisioned vCloud environment (these have been created before running terraform):

1x Org VDC.
1x Edge Gateway in the VDC, with 1x External Network and 1x External IP.
1x routed LAN network (which is not used at all in main.tf).

Let me know if you need more info, or if you want me to test something.

Didainius commented 5 years ago

Thank you. I have tried to reproduce on the basis of your config (your config with my vars) on vCD 9.5 and didn't yet. Could you try to reproduce this one more with this option enabled https://www.terraform.io/docs/providers/vcd/index.html#logging. It should generate a logfile with API calls. Double check if there is something you need to hide before sharing the config. Also - does this ever happen during "destroy" phase or only during "create"?

You shouldn't need to wait for an hour. If it is running 15 minutes instead of the 2-3 you said it takes usually it should be safe to stop and share both log files (the API calls and TF_LOG=TRACE)

lvirbalas commented 5 years ago

@pasikarkkainen would you also be able to try on a different vCD version? Thank you.

pasikarkkainen commented 5 years ago

@Didainius @lvirbalas Thanks! I'll get the API call log, and I'll see if I can try with vCD 9.5 aswell. My primary environment is vCD 9.7 currently.

pasikarkkainen commented 5 years ago

@Didainius I've only noticed this issue during 'create' phase.

terraform log with TF_LOG=TRACE: http://pasik.fi/terraform-vcd-provider-v2.4.0-edge_gw-config-regression-log-2.txt

VCD_API_LOGGING: http://pasik.fi/terraform-vcd-provider-v2.4.0-edge_gw-config-regression-log-2-api.txt

I cancelled the 'apply' operation at around 20 mins, it obviously was stuck and not progressing at all.

Let me know if you need something else!

Didainius commented 5 years ago

@pasikarkkainen, thanks for the logs!

tl;dr I have put the investigation details here, but you may jump to Solution part

Cause

The API log call shows that the last API request was done at 2019/09/25 21:15:08 while you stopped Terraform roughly at 2019/09/25 21:38:12

This hints that there was a deadlock inside Terraform's code. I suspect this is the culprit. (the first line acquired the lock but never released it. the second line just stayed there forever). Locks are needed because vCD does not allow to perform simultaneous operations on an edge gateway. So although Terraform parallelizes resource creation we have to do one at a time (if they reside on the same edge gateway)

2019-09-25T21:15:07.707+0300 [DEBUG] plugin.terraform-provider-vcd_v2.4.0_x4: 2019/09/25 21:15:07 [DEBUG] Locked "org:usernamedev1|vdc:DemoVDC|edge:DemoVDC_GW"
2019-09-25T21:15:07.714+0300 [DEBUG] plugin.terraform-provider-vcd_v2.4.0_x4: 2019/09/25 21:15:07 [DEBUG] Locking "org:usernamedev1|vdc:DemoVDC|edge:DemoVDC_GW"

It looks that the resource responsible for never releasing the lock is vcd_dnat.servicehost-http1-9. You will see such messages throughout the log. Obviously it should take less than 24 minutes for one DNAT rule :)

vcd_dnat.servicehost-http1-9: Still creating... (24m0s elapsed)
vcd_dnat.servicehost-http1-9: Still creating... (24m10s elapsed)
vcd_dnat.servicehost-http1-9: Still creating... (24m20s elapsed)

I very much doubt that this is some problem with this exact resource and would bet that the next time it could be different one.

Now looking at the API call log you have shared I see this is the payload for rule creation (there is more objects in that payload, but this is the rule that is being created), but it never gets a response (no more information in the API log after this POST)

2019/09/25 21:15:08 Request caller: vcd.resourceVcdDNATCreate-->vcd.resourceVcdDNATCreate-->govcd.(*EdgeGateway).AddNATPortMappingWithUplink-->govcd.executeRequest
2019/09/25 21:15:08 POST https://vcloudapi.dev.local/api/admin/edgeGateway/d6889a1f-130a-42bd-ae47-77e4bbc5012e/action/configureServices
2019/09/25 21:15:08 --------------------------------------------------------------------------------
2019/09/25 21:15:08 Request data: [8216] <?xml version="1.0" encoding="UTF-8"?>
...
<NatRule>
              <RuleType>DNAT</RuleType>
              <IsEnabled>true</IsEnabled>
              <GatewayNatRule>
                  <Interface href="https://vcloudapi.dev.local/api/admin/network/b4efaad4-5d99-4ac7-8169-8e7bc14e713a"></Interface>
                  <OriginalIp>172.16.2.11</OriginalIp>
                  <OriginalPort>6443</OriginalPort>
                  <TranslatedIp>10.10.10.11</TranslatedIp>
                  <TranslatedPort>6443</TranslatedPort>
                  <Protocol>tcp</Protocol>
              </GatewayNatRule>
          </NatRule>
...

Problem

We have witnessed a few times a situation where vCD does not respond to API call but our underlying SDK (go-vcloud-director) did not have an HTTP timeout set. It would just stay there in limbo forever.

The reason that this has not occurred prior 2.4.0 is that we have vastly improved simultaneous API call handling. Previously the provider would just try to provision everything at once (as I said vCD doesn't support it) and simply ignore any errors and retry it in a loop until some hard timeout. Now in 2.4.0 this is much more sophisticated and is not sending excessive erroneous API calls anymore, but an API request without response made it fall down and cry...

Solution

This was addressed with PR https://github.com/vmware/go-vcloud-director/pull/231 and the 2.5.0 version will actually use it. This wouldn't really solve HTTP not responding, but at least should throw our the HTTP client error - maybe it hints what is wrong with vCD and why it does not respond correctly to the API call.

I would like to ask you to build our provider from master branch (https://github.com/terraform-providers/terraform-provider-vcd) and try replicating the issue. It will use the newer SDK with timeout built in. If my assumptions are correct such case would fail after 600s (10 minutes so far) with an HTTP error which could help us understand your case.

I would like to see what error you have with this version. Below I have put a few lines how to build the provider from master branch. Do let me know if you need help and I could build it for you (just need to know your OS) and share somehow.

A short note on building the provider plugin

Prerequisite: Go 1.13 installed

On macOS and Linux to install it this should be as simple as

$ cd ~/mydir
$ git clone https://github.com/terraform-providers/terraform-provider-vcd.git
$ cd terraform-provider-vcd/
$ make install

It should put the plugin into your terraform plugins directory: ~/.terraform.d/plugins//terraform-provider-vcd_v2.5.0

Our make tooling will not be too friendly with Windows (although could work with Cygwin, but no need for that).

On Windows it should work like that:

$ cd ~/mydir
$ git clone https://github.com/terraform-providers/terraform-provider-vcd.git
$ cd terraform-provider-vcd/
$ go build -o terraform-provider-vcd_v2.5.0

Then you have to pick the binary file of plugin terraform-provider-vcd_v2.5.0 and move it to your working directory (where your .tf file exists).

Just to be sure it picks it up you can put the version constraint into provider section like that:

provider "vcd" {
  ...
  allow_unverified_ssl = "true"
  version              = "= 2.5.0"
  logging              = true
  ...
}

Do terraform init (to pick up the newer plugin) and use the regular commands further on

pasikarkkainen commented 5 years ago

Here are the logs with vcd provider v2.5.0 (built from master branch).

terraform log with TF_LOG=TRACE: http://pasik.fi/terraform-vcd-provider-v2.5.0-edge_gw-config-regression-log-3.txt

VCD_API_LOGGING: http://pasik.fi/terraform-vcd-provider-v2.5.0-edge_gw-config-regression-log-3-api.txt

I cancelled the create/apply operation at around ~30 mins, as it was stuck doing nothing.

pasikarkkainen commented 5 years ago

@Didainius Let me know if you need me to try other things!

Didainius commented 5 years ago

@pasikarkkainen I'm having a second look exactly now :) Should post some update just a bit later

Didainius commented 5 years ago

Ok, indeed the newer version helped and I see what is happening:

$ grep -Ri "timeout"  terraform-vcd-provider-v2.5.0-edge_gw-config-regression-log-3.txt
terraform-vcd-provider-v2.5.0-edge_gw-config-regression-log-3.txt:      * vcd_dnat.servicehost-http1-3: error setting DNAT rules: &errors.errorString{s:"error reconfiguring Edge Gateway: Post https://vcloudapi.dev.local/api/admin/edgeGateway/d6889a1f-130a-42bd-ae47-77e4bbc5012e/action/configureServices: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}
terraform-vcd-provider-v2.5.0-edge_gw-config-regression-log-3.txt:      * vcd_dnat.servicehost-http1-3: error setting DNAT rules: &errors.errorString{s:"error reconfiguring Edge Gateway: Post https://vcloudapi.dev.local/api/admin/edgeGateway/d6889a1f-130a-42bd-ae47-77e4bbc5012e/action/configureServices: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}
terraform-vcd-provider-v2.5.0-edge_gw-config-regression-log-3.txt:      * vcd_dnat.servicehost-http3: error setting DNAT rules: &errors.errorString{s:"error reconfiguring Edge Gateway: Post https://vcloudapi.dev.local/api/admin/edgeGateway/d6889a1f-130a-42bd-ae47-77e4bbc5012e/action/configureServices: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}
terraform-vcd-provider-v2.5.0-edge_gw-config-regression-log-3.txt:      * vcd_dnat.servicehost-http3: error setting DNAT rules: &errors.errorString{s:"error reconfiguring Edge Gateway: Post https://vcloudapi.dev.local/api/admin/edgeGateway/d6889a1f-130a-42bd-ae47-77e4bbc5012e/action/configureServices: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}
terraform-vcd-provider-v2.5.0-edge_gw-config-regression-log-3.txt:      * vcd_dnat.servicehost-http1-11: error setting DNAT rules: &errors.errorString{s:"error reconfiguring Edge Gateway: Post https://vcloudapi.dev.local/api/admin/edgeGateway/d6889a1f-130a-42bd-ae47-77e4bbc5012e/action/configureServices: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}
terraform-vcd-provider-v2.5.0-edge_gw-config-regression-log-3.txt:      * vcd_dnat.servicehost-http1-11: error setting DNAT rules: &errors.errorString{s:"error reconfiguring Edge Gateway: Post https://vcloudapi.dev.local/api/admin/edgeGateway/d6889a1f-130a-42bd-ae47-77e4bbc5012e/action/configureServices: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}

I was expecting that one API call would fail/timeout but it wouldn't be such a widespread problem. Your log shows 6 API calls that timed out (which is now hardcoded for 10 minutes). Probably waiting more would show even more such timeouts. This makes me think about any rate limiting and/or resource starvation at some level. A few questions coming out of that:

Is your edge gateway advanced? Does it have any other specific things, like HA configured?
Do you have any proxy, load balancer, firewall on workstation (or anything else) that could enforce connection rate limiting and blocking in between?
Can it be that your environment suffers from lack of resources?
Would you have any other env to test this on for the sake validating the problem?

A bit of experiment but:

Wondering if you could also try to run follow the log and if it starts timing out, does the HTML5 UI work at all (the edge gateway NAT tab)? (the HTML5 UI does pretty much the same API calls that Terraform is doing)

I did test your example on 9.5 previously (just put in my vars) and it worked. My environment is all packed into a single laptop so isn't too fast as well, but haven't seen such issue. I'm going to retest it on it a few more times on more versions.

Didainius commented 5 years ago

@pasikarkkainen + one more question. Do you have exact vCD version (I know it is 9.7, but maybe you have the build number?)

pasikarkkainen commented 5 years ago

Quick answers:

vCloud Director is version 9.7.0.2 (installed build 14029997).
Edge Gateway is Advanced, as vCD 9.7 requires all Edges to be Advanced, otherwise you can't edit the configuration.
There is a reverseproxy/LB in front of vCD cells, but there is no rate limit configured there.
I don't think it's a lack of resources -type of a problem.. remember Terraform vCD provider version 2.3.0 works all the time, while vCD provider 2.4.0/2.5.0 fails all the time..

I'll dig deeper and do more testing. I'll also check if I'm able to try in a different environment.

pasikarkkainen commented 5 years ago

I did a couple of test runs with terraform, and when checking the vCD cells LB logs I can see terraform provider (2.5.0) goes to some "silent mode" and does not do anything for 10 mins.. meaning in the beginning it does some requests, and then there's silence for next 10 minutes, and then some requests again, and then silence for 10 minutes.. it feels like the vcd provider http client goes to some wrong mode and just waits doing nothing? is that possible? I guess it's time to read the code :)

Didainius commented 5 years ago

Thanks for these details! So the provider sends one POST request to create DNAT rule and does not get the response as per the API logs. Because the edge gateway allows only one operation at a time, starting with 2.4 we are not firing anymore DNAT requests until the one is done (previously all simultaneous API calls were reporting - "resource is busy" and retrying). But that request is hung as it can be seen in the API log. It sent a request, but did not get a response. After these 10 minutes in 2.5 it times out and gives its way to another call. This is the reason why you see silence for 10 minutes.

I wonder if you could identify if the last API request in API log was processed by load balancer and sent the response back to Terraform provider? (you can see what was the endpoint and payload for that call in the API log)

Also for the sake of experiment, is there a way to bypass the reverseproxy/LB to rule it out? Or another env?

Didainius commented 5 years ago

@pasikarkkainen, I have built your version of vCD 9.7.0.2.14029997 and tried out your config. This worked for me, but I have made an experiment and interrupted my network connection a few times during terraform apply run. This gave me similar results to yours.

Now I have a made a special branch with lower HTTP timeout value (10 seconds instead of 600) and some "retry" mechanism for timed out HTTP requests which retries the timed out call up to 5 times. Also added extra logging. Could you build the provider from this branch and give it a spin. https://github.com/Didainius/terraform-provider-vcd/tree/client-problems

Please share the logs again (even if it doesn't fail). Some additional logging should be in place and help us troubleshoot the issue.

Thanks!

pasikarkkainen commented 5 years ago

@Didainius Thanks a lot! I'll give it a try soon. Meanwhile I was busy hacking the sources myself, I also added and configured various timeouts with lower values, and added lots of debug logging. Here's what I found out so far:

It feels like the http client in vcd provider is getting confused/corrupted somehow. It's reporting lots of totally messed up connection errors, which don't make much sense, as all my manual connection attempts using curl etc always work 100% OK (and vcd provider 2.3.0 works OK aswell).

I also enabled verbose debug logging in the LB, and the API requests which show up as failed in vcd provider are actually reported as being OK at the LB (with http status codes 200 / 202, etc). I also checked how long various API requests take from the LB logs, and it seems the duration of most of the API requests is less than 2 seconds. So it's super weird that the vcd provider waits forever on those requests.. LB reports the requests/responses being OK with total duration of 1-2 seconds, but vcd provider waits doing nothing until http timeout is triggered. Somehow the http client in vcd provider gets corrupted/broken and doesn't receive the responses properly.

What actually fails when the vcd provider API requests are reported as failed:

Seems to be a call to "client.Http.Do(req)" in go-vcloud-director govcd/api.go function executeRequestCustomErr().

Reason for the connection failure seems to be quite random, it's one of:

"net/http: TLS handshake timeout"
"net/http: timeout awaiting response headers"
"net/http: HTTP/1.x transport connection broken: write tcp client_ip:50304->vcd_lb_ip:443: write: bad file descriptor"
"Post https://vcloudapi.dev.local/api/admin/edgeGateway/d6889a1f-130a-42bd-ae47-77e4bbc5012e/action/configureServices: dial tcp: i/o timeout"

I just tried doing 100x API requests to vCD (via the LB of course, exactly in the same way as vcd provider does) in a loop, using curl, and every single one of them worked OK. So definitely something weird is going on..

Is it possible it's some kind of concurrency issue, which corrupts the internal http client state, which causes connection errors?

pasikarkkainen commented 5 years ago

@Didainius and here are logs with vcd provider built from your branch (https://github.com/Didainius/terraform-provider-vcd/tree/client-problems):

Unfortunately 'terraform apply' didn't work. It seems even with 5 retries still some of the requests failed.. I'm having hard time believing those errors are caused by network, as all my testing with curl works OK..

vbauzys commented 5 years ago

Hi @pasikarkkainen. Thank you for extend analysis from your side. Currently Dainius is already on vacation. We continue on investigation, just wanted outline a few things for you regarding your remarks:

client.Http.Do(req) - is standard golang library and it is concurrent safe - https://golang.org/src/net/http/client.go. So I wouldn't go deep to that side for an issue. Otherwise all community would be buzzing about not working library. Also we confirmed/tested with remote and local env that resource is working and we don't see HTTP handling issues.
2.3.0 version was working cause it also uses a lot of retries everywhere by default. I suspect you would see a lot of same errors in the logs with that version. As a matter of fact this version just allowed do retries unlimited times until configured timeout happens.
Latest version has limited concurrency - it depends on vCD resource. So in your case edge gateway can work only with one request - so provider locks on edge gateway and works with rules one by one.

We will continue analysis with your provided data. Major difference I see with our tested env is that you have additional reverseproxy/LB.

pasikarkkainen commented 5 years ago

@vbauzysvmware Thanks a lot! I'll investigate trying with with another LB aswell to see if that makes a difference.

pasikarkkainen commented 5 years ago

Ok some news. I spent more time reading and instrumenting the code with more verbose debug logging, and I finally figured out the reason why the 'retry logic' added by @Didainius didn't help ( https://github.com/Didainius/terraform-provider-vcd/tree/client-problems ).

There's a bug in the added http request retry code, making all the retried POST/PUT requests to fail. This code:

req = client.NewRequest(map[string]string{}, requestType, *url, body)

in govcd/api.go "executeRequestCustomErr()" needs to be:

req = client.NewRequest(map[string]string{}, requestType, *url, ioutil.NopCloser(body))

otherwise the first "client.Http.Do(req)" call reads and empties the req.body, and all the other retry attempts fail with error message "http: ContentLength= with Body length 0", because indeed the req.body has been emptied on the first attempt.

So with that fixed it seems the retry logic indeed helps! It seems all the requests which fail on the first attempt seem to succeed on the second try!

So now I can actually get "terraform apply" to report success with "Apply complete! Resources: 23 added, 0 changed, 0 destroyed."

I still need to investigate why some of the requests fail on the first attempt..

pasikarkkainen commented 4 years ago

Actually, the change above is not enough to fix the retry logic. It seems the golang http client does not support re-using the same http req between multiple requests/tries - one needs to re-set the req body before each try. After that the retry logic works OK.

I'm still trying to find time to do more debugging to figure out the root cause for http request problems.

Didainius commented 4 years ago

Thank you dor all the investigation work. I am back after holidays and will have a look for proper retry solution as well

pasikarkkainen commented 4 years ago

I just tried the latest terraform-provider-vcd v2.5.0 release, and I'm still seeing this problem. I'll have another debugging session in the near future, and hopefully can get this sorted out. Any new ideas to try btw?

Didainius commented 4 years ago

I just tried the latest terraform-provider-vcd v2.5.0 release, and I'm still seeing this problem. I'll have another debugging session in the near future, and hopefully can get this sorted out. Any new ideas to try btw?

Hi! My latest idea about the probably proper solution could be https://github.com/hashicorp/go-retryablehttp instead of trying to patch existing go http client. However I haven't had time to try and swap this client with our current one.

vmware / terraform-provider-vcd