local-exec and remote-exec resource appears stalled while Terraform keeps waiting for them and output from script is not displayed [ was Getting the remote-exec provisioner to detect SSH connectivity issues (with or without bastion_host)]

fisabelle commented 7 years ago

Hi. It seems like my 'remote-exec' resource continue running after the connection to my SSH bastion (and/or final target) drops. I haven't been able to identify the cause of the disconnection. but clearly Terraform 0.8.7 does not detect it. I was looking at the 'connection' parameters to see if I could enable a keep-alive as workaround but it does not seem supported.

This seems highly intermittent and potentially related to some instability with our local network at the moment but I'd rather have a fast failure than a stuck resource creation process.

UPDATE 2017-03-22 I'm getting the same result using local-exec that SSH to the target. I'm getting the same result regardless of the use of an SSH bastion or not. At the same time, I'm getting SSH disconnection when connected to the target using a separate shell.

Either something is re-spawning the SSH server on my server or something in the AWS network gets reconfigured and/the connection teared down as I suspected initially. So I understand one could argue that this is NOT a Terraform bug, per se. However, is must be possible to mitigate this problem.

enabling TCPKeepAlive on the bastion connection would not be very useful on its own considering that the default keep alive interval on Linux is 2 hours, I didn't check if the value was overridden by the Golang client code yet.
similarly, setting up ServerAliveInterval ServerAliveCountMax options in client could help 'fail fast' in case of network disruption and prevent long delays.

Terraform Version

0.8.8

Affected Resource(s)

Please list the resources as a list, for example:

null-resource with remote-exec using ssh_bastion

Terraform Configuration Files

resource "null_resource" "prep_install" { connection { user = "${var.ssh_user}" private_key = "${var.ssh_private_key}" timeout = "10m"
host = "${element(aws_instance.instance.*.private_ip, count.index)}" bastion_host = "${var.bastion_host}" bastion_user = "${var.bastion_user}" bastion_private_key = "${var.bastion_private_key}" }

provisioner "file" { source = "${path.module}/scripts/" destination = "/tmp/" }

provisioner "remote-exec" { inline = [ "set -eu", "sudo chmod +x /tmp/*.sh", "sudo /tmp/run.sh", ] } }

module.my.null_resource.prep_install.0 (remote-exec): module.my.null_resource.prep_install.0: Still creating... (1m50s elapsed) module.my.null_resource.prep_install.0: Still creating... (2m0s elapsed) ...

Expected Behavior

30 seconds after connection lost for instance. (no output or keepalive from remote-exec), the resource creation would fail. or SSH reconnection would be attempted and the remote-exec resumed somehow

Actual Behavior

the remote-exec seem to have completed while being disconnected from the session the resource creation failed after the specified timeout

Steps to Reproduce

unable to reproduce yet

Important Factoids

AWS VPC

fisabelle commented 7 years ago

Adding new data in case it helps figuring out what is happening.

I enabled SSH keepalive on both the SSH target host and the SSH bastion host: this has no impact. The remote-exec is still being run successfully but Terraform keeps waiting for the resource creation to complete.
I tested concurrent and non-concurrent instantiation of the resource: this has no impact. This issue is still present regardless.
I monitored the outgoing SSH connection (using netstat and tcpdump) on the Terraform client workstation and I did't see any apparent issue. On the bastion host, the SSH connection also appears to be kept for the duration of the execution (even after completion of the remote-exec on the target).

So I'm not sure anymore if this problem is bound to SSH connectivity issue or rather in a issue with the Go SSH client/proxy code used by Terraform and the way it forwards I/O. I'm in the process of reworking my bastion's "remote-exec" to "local-exec" with explicitl SSH connect+proxy "command" to see how this behaves with the non-native (OpenSSH) SSH client.

It seems like most of the SSH tunnel setup is built in the Golang SSH code and this is my number 1 suspect.

Note: I didn't have a chance to see if this only occurs when the SSH bastion is being used, yet. (I will try to test that tomorrow)

roman-vynar commented 7 years ago

The problem is

provisioner "file" { content="" ... }

It makes terraform to stall.

charltones commented 6 years ago

I just experienced the same symptoms, when connecting over a corporate VPN. If I disconnect from the VPN the apply runs fine. If I connect to the VPN then the output from apply seems to stop at some point (different each run) and terraform hangs forever saying 'still creating'. When I logged into the target machine it had actually finished the remote-exec it was in the middle of - so commands had run but their output had not been shown.

jsirex commented 6 years ago

Same issue for the terraform 0.11.1

nerguri commented 6 years ago

I also had a same issue while using local-exec while running ssh command to a resource through bastion host. (with ProxyCommand) I doubt if my ssh configuration could be related with this issue.

At first I tried to turn off the whole "ControlMaster" related options in ~/.ssh/config, then the issue has gone. So I tuned option in ssh command in local-exec such as adding -o ControlMaster=no and it also worked well.

My case is a little different from original one, but I hope that my workaround approach would be helpful to resolve the issue.

naquin commented 6 years ago

Also having the same issue when using a bastion host. The connection is successful, but then doesn't get around to executing any commands.

Denn0 commented 6 years ago

Still facing this issue. Very annoying...

prvnmali2017 commented 6 years ago

The problem is with null_resource, I get the below errors.



1 error(s) occurred:

* null_resource.puppet-mercury: 1 error(s) occurred:

* ssh: rejected: administratively prohibited (open failed)

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.```

My null_resource

```resource "null_resource" "puppet-mercury" {
  depends_on       = ["aws_instance.puppet-mercury"]
  triggers {
    cluster_instance_ids = "${join(",", aws_instance.puppet-mercury.*.id)}"
  }
 provisioner "puppet" {
   connection {
     user         = "user"                            //username for ssh connection
     type         = "ssh"
     agent        = "false"      timeout      = "1m"
     bastion_host = ["${aws_instance.nat01.public_ip}"]
    bastion_private_key = "${file("/Users/malipr/.ssh/<<KEY>>")}"
      host = "${element(aws_instance.puppet-mercury.*.private_ip, 0)}"
      private_key = "${file("/home/praveen/.ssh/<<KEY>>")}"
   }
    puppetmaster_ip = "${var.puppet_master_ip}" #ip of Puppet Master
    use_sudo        = true
  }```

b-dean commented 5 years ago

@prvnmali2017, I don't think your issue is the same. We are all able to connect to the remote host and run remote provisioners (like remote-exec or chef or whatever). The problem is that something is closing the connection and terraform never noticed. I think your error is different. You're never able to establish the connection in the first place.

Has anyone else come up with a good workaround for this? Or figured out what is killing the ssh connection? I'm constantly being affected by this and it's very annoying that the chef provisioner just keeps going forever and ever, even though its connection has been dead for hours. I'm using terraform 0.11.8

You can stop terraform without corrupting your tfstate by finding the process for

/usr/bin/terraform internal-plugin provisioner chef

(or remote-exec or whatever provisioner is "stuck") and killing it with kill -9. Then the main terraform process gets an "unexpected EOF" from chef and fails more gracefully. Not really a workaround for this problem, but a way to not screw up your tfstate if you have to kill terraform.

prvnmali2017 commented 5 years ago

@b-dean no worries, I fixed it up. Thanks a lot for reviewing.

michael-db commented 5 years ago

This problem still exists with remote-exec in terraform --version Terraform v0.11.10

null_resource.bootkube-start: Still creating... (22m30s elapsed) null_resource.bootkube-start: Still creating... (22m40s elapsed) null_resource.bootkube-start: Still creating... (22m50s elapsed)

When I ssh to the machine I can see that the command has completed.

chgasparoto commented 5 years ago

Hi @michael-db, I was able to fix putting sleep into my commands.

provisioner "remote-exec" {
    inline = [
      "sleep 10",
      "sudo yum update -y",
      "sudo yum install -y httpd",
      "sudo service httpd start",
      "sudo chkconfig httpd on",
      "sleep 10"
    ]

    connection {
      type        = "${local.conn_type}"
      user        = "${local.conn_user}"
      timeout     = "${local.conn_timeout}"
      private_key = "${local.conn_key}"
    }
  }

I ended up visiting this page on stackoverflow and after taking a look on Packer documentation (search for My builds don't always work the same).

Some distributions start the SSH daemon before other core services which can create race conditions. Your first provisioner can tell the machine to wait until it completely boots.

michael-db commented 5 years ago

@chgasparoto Thanks, Cleber. Glad that worked for your case. It depends on the reason for the disconnection, which in our case was an intermediate node rebooting after upgrading itself. We've managed to stop that happening. This issue remains of course, i.e., disconnection should be detected if it happens.

ghost commented 4 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / terraform