hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
41.77k stars 9.43k forks source link

Remote_exec stops running commands #16031

Open SAPDanJoe opened 6 years ago

SAPDanJoe commented 6 years ago

When using the remote_exec provisioner, it sometimes stops in the middle of the script and exits. The only way I have found to get around this, is to split the script into several null_resource blocks to execute the script one bit at a time.

Terraform Version

Terraform v0.10.3

Terraform Configuration Files

This is what I would like to be able to do:

...
# Create a machine and assign it a floating IP, and assign it a chef user and password in the userdata
...

resource "null_resource" "Prepare_Chef" {
  count = "${ var.server_count }"
  depends_on =  ["openstack_compute_floatingip_associate_v2.attach_corp_ip"]
  connection {
    host = "${element(openstack_networking_floatingip_v2.get_corp_ip.*.address, count.index)}"
    type = "winrm"
    user = "chef"
    password = "${ var.chef_pass }"
    insecure = true
  }
  provisioner "remote-exec"  {
      inline = [
           "powershell -Command \". { iwr -useb https://omnitruck.chef.io/install.ps1 } | iex; install -version ${ var.chef_client_version }\"",
           "powershell -Command \" . iwr -useb https://github.com/git-for-windows/git/releases/download/v2.14.1.windows.1/Git-2.14.1-64-bit.exe -OutFile C:\\chef\\repo\\git_installer.exe\"",
           "c:\\chef\\repo\\git_installer.exe /SILENT /COMPONENTS=\"icons,ext\\reg\\shellhere,assoc,assoc_sh\"",
           "set path=%path%;c:\\opscode\\chef\\bin\\;c:\\opscode\\chef\\embedded\\bin\\",
           "setx path %path%",
           "gem install berkshelf",
           "berks install -b c:\\chef\\repo\\Berksfile",
           "set git_ssl_no_verify=true",
           "berks vendor c:\\users\\chef\\.berkshelf\\cookbooks -b c:\\chef\\repo\\Berksfile",
           "chef-client --local-mode -r recipe[${ var.cookbook }::${ var.recipe }] --config-option cookbook_path=c:\\users\\chef\\.berkshelf\\cookbooks"
      ]
  }
}

This is what I have that works; but I should be able to do all this in one shot!

...
# Create a machine and assign it a floating IP, and assign it a chef user and password in the userdata
...

resource "null_resource" "Prepare_Chef" {
  count = "${ var.server_count }"
  depends_on =  ["openstack_compute_floatingip_associate_v2.attach_corp_ip"]
  connection {
    host = "${element(openstack_networking_floatingip_v2.get_corp_ip.*.address, count.index)}"
    type = "winrm"
    user = "chef"
    password = "${ var.chef_pass }"
    insecure = true
  }
  provisioner "remote-exec"  {
      inline = [
           "powershell -Command \". { iwr -useb https://omnitruck.chef.io/install.ps1 } | iex; install -version ${ var.chef_client_version }\"",
           "powershell -Command \" . iwr -useb https://github.com/git-for-windows/git/releases/download/v2.14.1.windows.1/Git-2.14.1-64-bit.exe -OutFile C:\\chef\\repo\\git_installer.exe\"",
           "c:\\chef\\repo\\git_installer.exe /SILENT /COMPONENTS=\"icons,ext\\reg\\shellhere,assoc,assoc_sh\"",
           "set path=%path%;c:\\opscode\\chef\\bin\\;c:\\opscode\\chef\\embedded\\bin\\",
           "setx path %path%",
           "gem install berkshelf",
           "berks install -b c:\\chef\\repo\\Berksfile"
      ]
  }
}

resource "null_resource" "Berks" {
  count = "${ var.server_count }"
  depends_on =  ["null_resource.Prepare_Chef"]
  connection {
    host = "${element(openstack_networking_floatingip_v2.get_corp_ip.*.address, count.index)}"
    type = "winrm"
    user = "chef"
    password = "${ var.chef_pass }"
    insecure = true
  }
  provisioner "remote-exec"  {
      inline = [
           "set git_ssl_no_verify=true",
           "berks vendor c:\\users\\chef\\.berkshelf\\cookbooks -b c:\\chef\\repo\\Berksfile"
      ]
  }
}

resource "null_resource" "Chef_solo" {
  count = "${ var.server_count }"
  depends_on =  ["null_resource.Berks"]
  connection {
    host = "${element(openstack_networking_floatingip_v2.get_corp_ip.*.address, count.index)}"
    type = "winrm"
    user = "chef"
    password = "${ var.chef_pass }"
    insecure = true
  }
  provisioner "remote-exec"  {
      inline = [
           "chef-client --local-mode -r recipe[${ var.cookbook }::${ var.recipe }] --config-option cookbook_path=c:\\users\\chef\\.berkshelf\\cookbooks"
      ]
  }
}

Expected Behavior

The `remote-exec' block should run all of the commands.

Actual Behavior

The remote-exec block stops part way through and show no error.

Please let me know if there is some limitation or something that I am missing.

SAPDanJoe commented 6 years ago

I've just tested and found that I get the same behavior when using remote-exec to run the script as a file on the instance.

data "template_file" "provision" {
  template = "${file("${path.module}/templates/provision.bat.tpl")}"

  vars {
    chef_ver = "${ var.chef_client_version }"
    cookbook = "${ var.cookbook }"
    recipe = "${ var.recipe }"
  }
}
resource "null_resource" "prep_script" {
  count = "${ var.server_count }"
  depends_on =  ["openstack_compute_floatingip_associate_v2.attach_corp_ip"]
  connection {
    host = "${element(openstack_networking_floatingip_v2.get_corp_ip.*.address, count.index)}"
    type = "winrm"
    user = "chef"
    password = "${ var.chef_pass }"
    insecure = true
  }
  provisioner "file" {
    content      = "${ data.template_file.provision.rendered }"
    destination = "c:\\chef\\repo\\provision.bat"
  }
}

resource "null_resource" "Prepare_Chef" {
  count = "${ var.server_count }"
  depends_on =  ["null_resource.prep_script"]
  connection {
    host = "${element(openstack_networking_floatingip_v2.get_corp_ip.*.address, count.index)}"
    type = "winrm"
    user = "chef"
    password = "${ var.chef_pass }"
    insecure = true
  }
  provisioner "remote-exec"  {
    inline = [
      "c:\\chef\\repo\\provision.bat"
    ]
  }
}
victorb commented 6 years ago

I'm seeing this as well, not only on windows but with linux as well.

resource "aws_instance" "linux" {
  security_groups             = ["${aws_security_group.jenkins_linux.name}"]
  ami                         = "${var.linux_ami}"
  instance_type               = "${var.linux_type}"
  associate_public_ip_address = true
  key_name                    = "victor-ssh-key"
  count                       = "${var.linux_count}"

  connection {
    type = "ssh"
    user = "ubuntu"
  }

  provisioner "file" {
    content     = "${data.template_file.jenkins_worker_service.rendered}"
    destination = "/tmp/swarm.service"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo apt update",
      "sudo apt install --yes wget htop default-jre",
      # TODO should be copied over instead of downloaded
      "cd /tmp && wget https://repo.jenkins-ci.org/releases/org/jenkins-ci/plugins/swarm-client/${var.swarm_version}/swarm-client-${var.swarm_version}.jar",
      "sudo mv /tmp/swarm.service /etc/systemd/system/swarm.service",
      "sudo systemctl start swarm",
    ]
  }
}

With this, the remote-exec provisioner doesn't execute the two last steps but instead exists successfull (???) after the wget.

victorb commented 6 years ago

Debug output:

aws_instance.linux (remote-exec): --2017-10-31 18:36:14--  https://repo.jenkins-ci.org/releases/org/jenkins-ci/plugins/swarm-client/3.6/swarm-client-3.6.jar
aws_instance.linux (remote-exec): Resolving repo.jenkins-ci.org (repo.jenkins-ci.org)... 130.211.20.35
aws_instance.linux (remote-exec): Connecting to repo.jenkins-ci.org (repo.jenkins-ci.org)|130.211.20.35|:443... connected.
aws_instance.linux (remote-exec): HTTP request sent, awaiting response... 200 OK
aws_instance.linux (remote-exec): Length: 1620623 (1.5M) [application/java-archive]
aws_instance.linux (remote-exec): Saving to: ‘swarm-client-3.6.jar’

aws_instance.linux (remote-exec):       swarm   0%       0  --.-KB/s
aws_instance.linux (remote-exec): swarm-clien 100%   1.54M  --.-KB/s    in 0.1s

aws_instance.linux (remote-exec): 2017-10-31 18:36:14 (12.1 MB/s) - ‘swarm-client-3.6.jar’ saved [1620623/1620623]

2017-10-31T18:36:14.848Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:14 remote command exited with '0': /tmp/terraform_682535596.sh
2017-10-31T18:36:14.848Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:14 opening new ssh session
2017-10-31T18:36:14.921Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:14 Starting remote scp process:  scp -vt /tmp
2017-10-31T18:36:14.995Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:14 Started SCP session, beginning transfers...
2017-10-31T18:36:14.996Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:14 Copying input data into temporary file so we can read the length
2017-10-31T18:36:14.998Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:14 Beginning file upload...
2017-10-31T18:36:15.072Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:15 SCP session complete, closing stdin pipe.
2017-10-31T18:36:15.072Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:15 Waiting for SSH session to complete.
2017-10-31T18:36:15.146Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:15 scp stderr (length 37): Sink: C0644 0 terraform_682535596.sh
aws_instance.linux: Creation complete after 1m4s (ID: i-0335fd31dbfa9d2df)

Apply complete! Resources: 4 added, 0 changed, 0 destroyed.
2017/10/31 18:36:15 [DEBUG] plugin: waiting for all plugin processes to complete...

Outputs:

linux_ips = [
    34.230.x.x
]
windows_ips = []
2017-10-31T18:36:15.177Z [DEBUG] plugin.terraform: file-provisioner (internal) 2017/10/31 18:36:15 [DEBUG] plugin: waiting for all plugin processes to complete...
2017-10-31T18:36:15.178Z [DEBUG] plugin.terraform: remote-exec-provisioner (internal) 2017/10/31 18:36:15 [DEBUG] plugin: waiting for all plugin processes to complete...
2017-10-31T18:36:15.180Z [DEBUG] plugin: plugin process exited: path=/usr/bin/terraform
2017-10-31T18:36:15.180Z [DEBUG] plugin: plugin process exited: path=/root/jenkins/worker/.terraform/plugins/linux_amd64/terraform-provider-aws_v1.1.0_x4
2017-10-31T18:36:15.182Z [WARN ] plugin: error closing client during Kill: err="unexpected EOF"
2017-10-31T18:36:15.182Z [DEBUG] plugin: plugin process exited: path=/usr/bin/terraform

Guessing it's the plugin: error closing client during Kill: err="unexpected EOF" that is the trouble, but I no idea if it's a problem with my configuration or Terraform. For the record, the same steps did work on a different machine (desktop at my home, rather than running from a DO droplet)

victorb commented 6 years ago

After some more debugging, I tried downgrading Terraform and also the aws provider but to no success, this issue happens in all conditions currently. What was working before on my desktop at home does not work anymore, which makes me believe that something changed on AWS side of things.

YakDriver commented 6 years ago

This is probably same issue as https://github.com/hashicorp/terraform/issues/15963. It appears that for all provisioners, on all platforms, once Terraform thinks "Creation complete" then the rest of provisioning is halted. In my case, I have two inline commands in a remote-exec provisioner. The first command is supposed to block until an AWS userdata script completes and a signaling file is written to a temporary directory (takes about 13 minutes). This approach is suggested by @calvn in https://github.com/hashicorp/terraform/issues/4668.

However, what actually happens is that the first inline remote-exec command never completes, as you can see below (the Setup not complete. Retrying... message comes from the first command). The second command is never called. As soon as the "Creation complete" message appears, the remote-exec is stopped and no more commands execute.

[0m[1maws_instance.windows: Still creating... (11m0s elapsed)
[0m[0m
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[1maws_instance.windows: Still creating... (11m10s elapsed)
[0m[0m
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[1maws_instance.windows: Still creating... (11m20s elapsed)
[0m[0m
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[0maws_instance.windows (remote-exec): Setup not complete. Retrying...
[0m[1maws_instance.windows: Creation complete after 11m25s (ID: i-0f7160b825268416g)
[0m[0m
[0m[1m[32m Apply complete! Resources: 5 added, 0 changed, 1 destroyed.
[0m[0m
[1m[32m Outputs:  
ami2016 = ami-0a792a70
[0m  
[Container] 2018/01/11 16:12:49 Running command   
[Container] 2018/01/11 16:12:49 Running command   
[Container] 2018/01/11 16:12:49 Running command  
[Container] 2018/01/11 16:12:49 Running command   
[Container] 2018/01/11 16:12:49 Running command   
...
[Container] 2018/01/11 16:12:49 Phase complete: BUILD Success: true 
[Container] 2018/01/11 16:12:49 Phase context status code: Message: 
Arlington1985 commented 4 years ago

This is my solution:

provisioner "remote-exec" {
    inline = [
       "tail -f /var/log/cloud-init-output.log | sed '/SOME KEYWORD/q'"
    ]
  }
  connection {
    type        = "ssh"
    host        = "${aws_instance.instance.public_ip}" 
    user        = "ec2-user"
    private_key = "${file(var.ssh_key_path)}"
  }
azt3k commented 1 year ago

Still seeing this issue with terraform 1.3.2 and aws 4.56.0

In my scenario this happens seemingly randomly and if I run apply again it generally works.

Tried the keep alive trick outlined here and it made no difference: https://github.com/hashicorp/terraform/issues/18517