hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.69k stars 9.55k forks source link

null_resource should taint one or more of it's triggers if the null_resource fails to apply. #24675

Open Houlistonm opened 4 years ago

Houlistonm commented 4 years ago

Current Terraform Version

v0.12.20

Use-cases

I have a module that launches our AWS hosts and configures via Cloud-Init.

We’ve been experiencing some failures with Cloud-Init and we initially added a remote-exec provisioner to the instance to wait for Cloud-Init to finish and the status would cause a pass/fail in Terraform apply.

Unfortunately that caused a deadlock, our Cloud-Init needed a volume attached to complete, and TF couldn’t attach the volumes while it was waiting on Cloud-Init to complete.

Our solution was to use a null_resource, and this works for all of our use cases with one side-effect.

If the null_resource fails, it doesn’t taint the instance. So the next TF run, the null_resource is re-created and fails because the [broken] host still exists.

This causes problems in our CI/CD pipeline.

Asked for ideas on discuss

Attempted Solutions

resource "aws_instance" "instance" {
  count = var.instance_count

  ami                  = var.ami_id
  instance_type        = var.instance_type
  user_data            = var.user_data
  iam_instance_profile = var.iam_instance_profile

  // omitted many attributes to save space
}

resource "aws_volume_attachment" "volume_attachment" {
  count        = var.volume_ids == null ? 0 : length(var.volume_ids)
  skip_destroy = true
  instance_id  = element(aws_instance.instance.*.id, count.index)
  volume_id    = element(var.volume_ids, count.index)
  device_name  = var.device_name
}

resource "null_resource" "cloud_init_status" {
  count = var.bot_key_pem != null ? var.instance_count : 0

  triggers = {
    instance_id = element(aws_instance.instance.*.id, count.index)
  }

  provisioner "remote-exec" {
    inline = [
      "echo \"Running cloud-init status --wait > /dev/null\"",
      "sudo cloud-init status --wait > /dev/null",
      "sudo cloud-init status --long"
    ]

    connection {
      user         = "bot"
      host         = element(aws_instance.instance.*.private_ip, count.index)
      timeout      = var.ssh_timeout
      private_key  = var.bot_key_pem
      bastion_host = var.bastion_host
    }
  }
}

Proposal

I'm proposing that the null_resources should support a new argument to that would taint a resource on failure of the null_resource. Ideally, this should be limited to the one of the resources that triggered the null_resource.

resource "null_resource" "cloud_init_status" {
  count = var.bot_key_pem != null ? var.instance_count : 0

  triggers = {
    instance_id = element(aws_instance.instance.*.id, count.index)
  }

  taint_on_failure = {
    instance_id = element(aws_instance.instance.*.id, count.index)
  }

  provisioner "remote-exec" {
    inline = [
      "echo \"Running cloud-init status --wait > /dev/null\"",
      "sudo cloud-init status --wait > /dev/null",
      "sudo cloud-init status --long"
    ]

    connection {
      user         = "bot"
      host         = element(aws_instance.instance.*.private_ip, count.index)
      timeout      = var.ssh_timeout
      private_key  = var.bot_key_pem
      bastion_host = var.bastion_host
    }
  }
}

References

Houlistonm commented 4 years ago

Ping. almost 60 days w/o feedback/comments.

Houlistonm commented 4 years ago

@mitchellh Apologies if direct contact is considered inappropriate. You're the #1 on the contribution list for this repository. This is a continuing problem for our team, and I would welcome the opportunity to discuss it. Thank you. Mark.

Houlistonm commented 2 years ago

This is still important to our project, can somebody please look at this?