Closed fisabelle closed 5 years ago
Adding new data in case it helps figuring out what is happening.
So I'm not sure anymore if this problem is bound to SSH connectivity issue or rather in a issue with the Go SSH client/proxy code used by Terraform and the way it forwards I/O. I'm in the process of reworking my bastion's "remote-exec" to "local-exec" with explicitl SSH connect+proxy "command" to see how this behaves with the non-native (OpenSSH) SSH client.
It seems like most of the SSH tunnel setup is built in the Golang SSH code and this is my number 1 suspect.
Note: I didn't have a chance to see if this only occurs when the SSH bastion is being used, yet. (I will try to test that tomorrow)
The problem is
provisioner "file" { content="" ... }
It makes terraform to stall.
I just experienced the same symptoms, when connecting over a corporate VPN. If I disconnect from the VPN the apply runs fine. If I connect to the VPN then the output from apply seems to stop at some point (different each run) and terraform hangs forever saying 'still creating'. When I logged into the target machine it had actually finished the remote-exec it was in the middle of - so commands had run but their output had not been shown.
Same issue for the terraform 0.11.1
I also had a same issue while using local-exec
while running ssh command to a resource through bastion host. (with ProxyCommand)
I doubt if my ssh configuration could be related with this issue.
At first I tried to turn off the whole "ControlMaster" related options in ~/.ssh/config
, then the issue has gone. So I tuned option in ssh command in local-exec
such as adding -o ControlMaster=no
and it also worked well.
My case is a little different from original one, but I hope that my workaround approach would be helpful to resolve the issue.
Also having the same issue when using a bastion host. The connection is successful, but then doesn't get around to executing any commands.
Still facing this issue. Very annoying...
The problem is with null_resource, I get the below errors.
1 error(s) occurred:
* null_resource.puppet-mercury: 1 error(s) occurred:
* ssh: rejected: administratively prohibited (open failed)
Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.```
My null_resource
```resource "null_resource" "puppet-mercury" {
depends_on = ["aws_instance.puppet-mercury"]
triggers {
cluster_instance_ids = "${join(",", aws_instance.puppet-mercury.*.id)}"
}
provisioner "puppet" {
connection {
user = "user" //username for ssh connection
type = "ssh"
agent = "false" timeout = "1m"
bastion_host = ["${aws_instance.nat01.public_ip}"]
bastion_private_key = "${file("/Users/malipr/.ssh/<<KEY>>")}"
host = "${element(aws_instance.puppet-mercury.*.private_ip, 0)}"
private_key = "${file("/home/praveen/.ssh/<<KEY>>")}"
}
puppetmaster_ip = "${var.puppet_master_ip}" #ip of Puppet Master
use_sudo = true
}```
@prvnmali2017, I don't think your issue is the same. We are all able to connect to the remote host and run remote provisioners (like remote-exec or chef or whatever). The problem is that something is closing the connection and terraform never noticed. I think your error is different. You're never able to establish the connection in the first place.
Has anyone else come up with a good workaround for this? Or figured out what is killing the ssh connection? I'm constantly being affected by this and it's very annoying that the chef provisioner just keeps going forever and ever, even though its connection has been dead for hours. I'm using terraform 0.11.8
You can stop terraform without corrupting your tfstate by finding the process for
/usr/bin/terraform internal-plugin provisioner chef
(or remote-exec or whatever provisioner is "stuck") and killing it with kill -9
. Then the main terraform process gets an "unexpected EOF" from chef and fails more gracefully. Not really a workaround for this problem, but a way to not screw up your tfstate if you have to kill terraform.
@b-dean no worries, I fixed it up. Thanks a lot for reviewing.
This problem still exists with remote-exec in terraform --version Terraform v0.11.10
null_resource.bootkube-start: Still creating... (22m30s elapsed) null_resource.bootkube-start: Still creating... (22m40s elapsed) null_resource.bootkube-start: Still creating... (22m50s elapsed)
When I ssh to the machine I can see that the command has completed.
Hi @michael-db, I was able to fix putting sleep
into my commands.
provisioner "remote-exec" {
inline = [
"sleep 10",
"sudo yum update -y",
"sudo yum install -y httpd",
"sudo service httpd start",
"sudo chkconfig httpd on",
"sleep 10"
]
connection {
type = "${local.conn_type}"
user = "${local.conn_user}"
timeout = "${local.conn_timeout}"
private_key = "${local.conn_key}"
}
}
I ended up visiting this page on stackoverflow and after taking a look on Packer documentation (search for My builds don't always work the same).
Some distributions start the SSH daemon before other core services which can create race conditions. Your first provisioner can tell the machine to wait until it completely boots.
@chgasparoto Thanks, Cleber. Glad that worked for your case. It depends on the reason for the disconnection, which in our case was an intermediate node rebooting after upgrading itself. We've managed to stop that happening. This issue remains of course, i.e., disconnection should be detected if it happens.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Hi. It seems like my 'remote-exec' resource continue running after the connection to my SSH bastion (and/or final target) drops. I haven't been able to identify the cause of the disconnection. but clearly Terraform 0.8.7 does not detect it. I was looking at the 'connection' parameters to see if I could enable a keep-alive as workaround but it does not seem supported.
This seems highly intermittent and potentially related to some instability with our local network at the moment but I'd rather have a fast failure than a stuck resource creation process.
UPDATE 2017-03-22 I'm getting the same result using local-exec that SSH to the target. I'm getting the same result regardless of the use of an SSH bastion or not. At the same time, I'm getting SSH disconnection when connected to the target using a separate shell.
Either something is re-spawning the SSH server on my server or something in the AWS network gets reconfigured and/the connection teared down as I suspected initially. So I understand one could argue that this is NOT a Terraform bug, per se. However, is must be possible to mitigate this problem.
Terraform Version
0.8.8
Affected Resource(s)
Please list the resources as a list, for example:
Terraform Configuration Files
resource "null_resource" "prep_install" { connection { user = "${var.ssh_user}" private_key = "${var.ssh_private_key}" timeout = "10m"
host = "${element(aws_instance.instance.*.private_ip, count.index)}" bastion_host = "${var.bastion_host}" bastion_user = "${var.bastion_user}" bastion_private_key = "${var.bastion_private_key}" }
provisioner "file" { source = "${path.module}/scripts/" destination = "/tmp/" }
provisioner "remote-exec" { inline = [ "set -eu", "sudo chmod +x /tmp/*.sh", "sudo /tmp/run.sh", ] } }
module.my.null_resource.prep_install.0 (remote-exec):
module.my.null_resource.prep_install.0: Still creating... (1m50s elapsed)
module.my.null_resource.prep_install.0: Still creating... (2m0s elapsed)
...
Expected Behavior
30 seconds after connection lost for instance. (no output or keepalive from remote-exec), the resource creation would fail. or SSH reconnection would be attempted and the remote-exec resumed somehow
Actual Behavior
the remote-exec seem to have completed while being disconnected from the session the resource creation failed after the specified timeout
Steps to Reproduce
unable to reproduce yet
Important Factoids
AWS VPC