hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.36k stars 9.49k forks source link

Not Able To Install/Run During Terraform Apply #33230

Closed aronchick closed 1 year ago

aronchick commented 1 year ago

Terraform Version

Terraform v1.4.6
on darwin_arm64

Terraform Configuration Files

resource "aws_instance" "instance" {
  ami                    = var.instance_ami
  instance_type          = var.instance_type
  subnet_id              = var.subnet_public_id
  vpc_security_group_ids = var.security_group_ids
  key_name               = var.key_pair_name
  availability_zone      = var.availability_zone

  tags = {
    App  = var.app_tag
    Name = "${var.app_tag}-instance"
  }

  // Move all files from scripts to the remote instance in the /node directory
  provisioner "file" {
    connection {
      host        = self.public_dns
      port        = 22
      user        = "ubuntu"
      private_key = var.private_key_content
    }

    source      = "node_files"
    destination = "/home/ubuntu/node_files"
  }

  provisioner "remote-exec" {
    connection {
      host        = self.public_dns
      port        = 22
      user        = "ubuntu"
      private_key = var.private_key_content
    }

    inline = [
      "sudo mkdir /node",
      "sudo mv /home/ubuntu/node_files/* /node",
      "sudo rm -rf /home/ubuntu/node_files",
      "sudo chmod +x /node/*.sh",
      "sudo mkdir /data",
      "sudo mv /node/bacalhau.service /etc/systemd/system",
      "echo \"$USER ALL=(ALL:ALL) NOPASSWD: ALL\" | sudo tee \"/etc/sudoers.d/dont-prompt-$USER-for-sudo-password\"",
      "sudo mv /node/sshd_config /etc/ssh/sshd_config",
      "sudo systemctl restart sshd",
      "echo 0",
    ]
  }
}

resource "aws_eip" "instanceeip" {
  vpc      = true
  instance = aws_instance.instance.id

  tags = {
    App = var.app_tag
  }
}

resource "null_resource" "copy-to-node-if-worker" {
  count = var.bootstrap_region == var.region ? 0 : 1

  connection {
    host        = aws_eip.instanceeip.public_ip
    port        = 22
    user        = "ubuntu"
    private_key = var.private_key_content
  }

  provisioner "file" {
    destination = "/etc/bacalhau-bootstrap"
    content     = file(var.bacalhau_run_file)
  }
}

resource "null_resource" "install_tailscale_and_join" {
  depends_on = [aws_instance.instance, null_resource.copy-to-node-if-worker]

  connection {
    host        = aws_instance.instance.public_dns
    port        = 22
    user        = "ubuntu"
    private_key = var.private_key_content
    timeout     = 600
  }

  // Run the commands in the remote instance to install_tailscale.sh, install_bacalhau.sh, and restart the bacalhau service
  provisioner "remote-exec" {
    inline = [
      "sudo chmod +x /node/install-tailscale.sh",
      "sudo /node/install-tailscale.sh ${aws_instance.instance.public_dns}",
      "sudo tailscale up",
      "echo 0",
    ]

  }
}

Debug Output

[... many lines above ...]
module.instanceModule.null_resource.install_tailscale_and_join: Provisioning with 'remote-exec'...
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec): Connecting to remote host via SSH...
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec):   Host: ec2-35-182-12-176.ca-central-1.compute.amazonaws.com
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec):   User: ubuntu
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec):   Password: false
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec):   Private key: true
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec):   Certificate: false
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec):   SSH Agent: true
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec):   Checking Host Key: false
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec):   Target Platform: unix
module.instanceModule.null_resource.install_tailscale_and_join (remote-exec): Connected!
module.instanceModule.aws_eip.instanceeip: Creation complete after 1s [id=eipalloc-0b50a293c13823180]
module.instanceModule.null_resource.install_tailscale_and_join: Still creating... [10s elapsed]
module.instanceModule.null_resource.install_tailscale_and_join: Still creating... [20s elapsed]
module.instanceModule.null_resource.install_tailscale_and_join: Still creating... [30s elapsed]
module.instanceModule.null_resource.install_tailscale_and_join: Still creating... [40s elapsed]
╷
│ Error: remote-exec provisioner error
│
│   with module.instanceModule.null_resource.install_tailscale_and_join,
│   on modules/instance/resources.tf line 88, in resource "null_resource" "install_tailscale_and_join":
│   88:   provisioner "remote-exec" {
│
│ Error starting script: EOF

Expected Behavior

The script should have run

Actual Behavior

The script exited with 0 (but did not complete the terraform apply).

Steps to Reproduce

terraform apply

Additional Context

I am trying to install and run Tailscale during a terraform install (I do not think this is a tailscale related issue, just as an example.

The resource.tf file is here:

resource "aws_instance" "instance" {
  ami                    = var.instance_ami
  instance_type          = var.instance_type
  subnet_id              = var.subnet_public_id
  vpc_security_group_ids = var.security_group_ids
  key_name               = var.key_pair_name
  availability_zone      = var.availability_zone

  tags = {
    App  = var.app_tag
    Name = "${var.app_tag}-instance"
  }

  // Move all files from scripts to the remote instance in the /node directory
  provisioner "file" {
    connection {
      host        = self.public_dns
      port        = 22
      user        = "ubuntu"
      private_key = var.private_key_content
    }

    source      = "node_files"
    destination = "/home/ubuntu/node_files"
  }

  provisioner "remote-exec" {
    connection {
      host        = self.public_dns
      port        = 22
      user        = "ubuntu"
      private_key = var.private_key_content
    }

    inline = [
      "sudo mkdir /node",
      "sudo mv /home/ubuntu/node_files/* /node",
      "sudo rm -rf /home/ubuntu/node_files",
      "sudo chmod +x /node/*.sh",
      "sudo mkdir /data",
      "sudo mv /node/bacalhau.service /etc/systemd/system",
      "echo \"$USER ALL=(ALL:ALL) NOPASSWD: ALL\" | sudo tee \"/etc/sudoers.d/dont-prompt-$USER-for-sudo-password\"",
      "sudo mv /node/sshd_config /etc/ssh/sshd_config",
      "sudo systemctl restart sshd",
      "echo 0",
    ]
  }
}

resource "aws_eip" "instanceeip" {
  vpc      = true
  instance = aws_instance.instance.id

  tags = {
    App = var.app_tag
  }
}

resource "null_resource" "copy-to-node-if-worker" {
  count = var.bootstrap_region == var.region ? 0 : 1

  connection {
    host        = aws_eip.instanceeip.public_ip
    port        = 22
    user        = "ubuntu"
    private_key = var.private_key_content
  }

  provisioner "file" {
    destination = "/etc/bacalhau-bootstrap"
    content     = file(var.bacalhau_run_file)
  }
}

resource "null_resource" "install_tailscale_and_join" {
  depends_on = [aws_instance.instance, null_resource.copy-to-node-if-worker]

  connection {
    host        = aws_instance.instance.public_dns
    port        = 22
    user        = "ubuntu"
    private_key = var.private_key_content
    timeout     = 600
  }

  // Run the commands in the remote instance to install_tailscale.sh, install_bacalhau.sh, and restart the bacalhau service
  provisioner "remote-exec" {
    inline = [
      "sudo chmod +x /node/install-tailscale.sh",
      "sudo /node/install-tailscale.sh ${aws_instance.instance.public_dns}",
      "sudo tailscale up",
      "echo 0",
    ]

  }
}

Terraform recommends against using remote-exec/provisioners and instead using Packer, but due to the nature of this deployment, I would prefer to do these installations in this way.

The closest research I got to was here -https://github.com/hashicorp/terraform/issues/18517 - which feels correct, but adding the keep alive did not seem to help at all.

Not sure what to do at this point - the script runs fine, this is provable when I log in.

References

No response

jbardin commented 1 year ago

Hi @aronchick,

If you have confirmed that the /node/install-tailscale.sh is transferred correctly, then it seems there is something incorrect in the remote-exec script. It may help to set set -o errexit in your script, especially since you finish with echo 0 which will ensure that any exit status from the prior command is lost.

More clues may be found in the sshd debug logs, which could verify what commands were received and executed. The fact that you are sending systemctl restart sshd to the server before connecting and running another command may be interfering as well.

Since the execution of the remote commands is outside of the control of Terraform, it's unlikely that there's anything we can fix here. The community forum may have others who can help debug your problem, and if in doing so you do find an issue caused by Terraform, feel free to file a new bug report with the new information to replicate the problem.

Thanks!

aronchick commented 1 year ago

i appreciate the feedback - due to terraform's issues, i ended up rearchitecting it. I guess the most fundamental thing is that "exit 0" SEEMS like it should have been a good thing for terraform. What's the alternative?

jbardin commented 1 year ago

I'm not sure what the intent of exit 0 would be there. It either does nothing because the default exit code is 0, or it hides an error from the previous command, neither of which seem helpful. A more robust script would ensure that each individual command succeeded in full, which can be done bluntly with set -o errexit (though the script may not have been the problem at all, but that would need to be debugged on the server).

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.