hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.6k stars 9.55k forks source link

Error when instance changed that has EBS volume attached #2957

Closed bloopletech closed 7 years ago

bloopletech commented 9 years ago

This is the specific error I get from terraform:

aws_volume_attachment.admin_rundeck: Destroying...
aws_volume_attachment.admin_rundeck: Error: 1 error(s) occurred:

* Error waiting for Volume (<vol id>) to detach from Instance: <instance id>
Error applying plan:

3 error(s) occurred:

* Error waiting for Volume (<vol id>) to detach from Instance: <instance id>
* aws_instance.admin_rundeck: diffs didn't match during apply. This is a bug with Terraform and should be reported.
* aws_volume_attachment.admin_rundeck: diffs didn't match during apply. This is a bug with Terraform and should be reported.

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

We are building out some infrastructure in EC2 using terraform (v0.6.0). I'm currently working out our persistent storage setup. The strategy I'm planning is to have the root volume of every instance be ephemeral, and to move all persistent data to a separate EBS volume (one persistent volume per instance). We want this to be as automated as possible of course.

Here is a relevant excerpt from our terraform config:

resource "aws_instance" "admin_rundeck" {
  ami = "${var.aws_ami_rundeck}"
  instance_type = "${var.aws_instance_type}"
  subnet_id = "${aws_subnet.admin_private.id}"
  vpc_security_group_ids = ["${aws_security_group.base.id}", "${aws_security_group.admin_rundeck.id}"]
  key_name = "Administration"

  root_block_device {
    delete_on_termination = false
  }

  tags {
    Name = "admin-rundeck-01"
    Role = "rundeck"
    Application = "rundeck"
    Project = "Administration"
  }
}

resource "aws_ebs_volume" "admin_rundeck" {
  size = 500
  availability_zone = "${var.default_aws_az}"
  snapshot_id = "snap-66fc2258"
  tags = {
    Name = "Rundeck Data Volume"
  }
}

resource "aws_volume_attachment" "admin_rundeck" {
  device_name = "/dev/xvdf"
  instance_id = "${aws_instance.admin_rundeck.id}"
  volume_id = "${aws_ebs_volume.admin_rundeck.id}"

  depends_on = "aws_route53_record.admin_rundeck"

  connection {
    host = "admin-rundeck-01.<domain name>"
    bastion_host = "${aws_instance.admin_jumpbox.public_ip}"
    timeout = "1m"
    key_file = "~/.ssh/admin.pem"
    user = "ubuntu"
  }

  provisioner "remote-exec" {
    script = "mount.sh"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo mkdir -m 2775 /data/rundeck",
      "sudo mkdir /data/rundeck/data /data/rundeck/projects && sudo chown -R rundeck:rundeck /data/rundeck",
      "sudo service rundeckd restart"
    ]
  }
}

And mount.sh:

#!/bin/bash

while [ ! -e /dev/xvdf ]; do sleep 1; done

fstab_string='/dev/xvdf /data ext4 defaults,nofail,nobootwait 0 2'
if grep -q -F -v "$fstab_string" /etc/fstab; then
  echo "$fstab_string" | sudo tee -a /etc/fstab
fi

sudo mkdir -p /data && sudo mount -t ext4 /dev/xvdf /data

As you can see, this:

This works fine the first time it's run. But any time we:

Terraform then tries to detach the extant volume from the instance, and this task fails every time. I believe this is because you are meant to unmount the ebs volume from inside the instance before detaching the volume. The problem is, I can't work out how to get terraform to unmount the volume inside the instance before trying to detach the volume.

It's almost like I need a provisioner to run before the resource is created, or a provisioner to run on destroy (obviously https://github.com/hashicorp/terraform/issues/386 comes to mind).

This feels like it would be a common problem for anyone working with persistent EBS volumes using terraform, but my googling hasn't really found anyone even having this problem.

Am I simply doing it wrong? I'm not worried about how I get there specifically, I just would like to be able to provision persistent EBS volumes, and then attach and detach that volume to my instances in an automated fashion.

Gary-Armstrong commented 8 years ago

I mostly agree. We are using Chef. I'd really love to use Chef to manage AWS SG (for example) because TF is awful at that, but no progress at this time. EBS, though, I would like to create during provision. I'm probably going to rewrite my code to have Chef do that instead of TF.

I'd also often like those EBS to stay around during instance destruction. There are several reasons for this, one being that TF does enjoy resource destruction quite a lot. Of course, that comes about when I need to make a infrastructure change, so if TF is meant only for initial deployments then what am I expected to use when I need to modify my infrastructure? Am I really expected to redeploy an entire set of large scientific computing instances? I would not like that but I would dislike that activity less if I am able to reuse the multi-TB EBS that I store data on.

james-masson commented 8 years ago

Immutable infrastructure with separate long-lived data disks is a great design pattern, and should definitely be encouraged. Kudos to Google Borg / Pivotal BOSH for making it more widespread.

My current workaround for this issue is to use Terraform to provision all the objects:

Instance, EBS volume ( set to not allow destruction) , IAM profile (to allow attachment of the two)...

... and then use a cloud-init bootstrap to have the instance securely associate it's ( restricted through IAM policy) EBS volume. Because terraform has the overview of which instance is notionally tied to which EBS volume, it can set all the right metadata to make that relationship visible to cloud-init and userspace. Shutdown is a no-op, and instance destruction clears the Instance <> EBS relationship anyway.

It's really sad that the EBS association still has to be handled separately. The current aws_volume_attachment is pretty much useless for any infrastructure that needs to change regularly.

The ideal solution for me would be that aws_volume_attachment only activates when the instance is in the shutdown state (for both attach/detach), although I realise that this is non-trivial given the symmetric nature of terraform's create/destroy process.

Perhaps the concept of aws_volume_attachment as a separate resource is the wrong way to think about it? Perhaps a tweak to aws_instance resource native EBS handling would be a better way to achieve this concept.

dvianello commented 8 years ago

Adding our bit: we're planning to use volumes as a way to send data in through a "data transfer vm" instance and then move them to a computing cluster, thus detaching the volume from the first instance and re-attaching it to another VM. That's probably not the best "cloud-friendly" approach, but that's the easiest way to get the ball rolling. Up to now we were using the taint trick to solve this, but after 0.7 landed that doesn't work anymore (as @LeslieCarr said). We're now pretty much blocked, as we haven't found a workaround yet, apart from logging into the machine and unmounting the volume.

@maxenglander, any ETA on when your changes might land on a release? This is now critical to us!

Gary-Armstrong commented 8 years ago

I do agree with the group that says OS-layer actions are outside the scope of TF; this includes quiescing IO and umounting fs. I've managed to avoid (so far) the need to move EBS around but it's a tactic that I can see value in. Along that line, I've gotten a Chef cookbook at 90% which will do all the EBS deployment work (including LVM and fs) and I expect I'll use that as a base if I need to move things around. I'm only using TF to spec the ephemerals at this point.

Now what I really need is a cookbook that manages security groups, but that's a separate TF issue.

dvianello commented 8 years ago

Yes, I'm also fine with TF not messing with OS actions, but here we're talking about a bug/feature missing, I believe. It would just need to follow a simply different behaviour while calling the AWS API, not doing something OS-level.

maxenglander commented 8 years ago

I fully concur with those (@charity, @Gary-Armstrong) who have voiced that TF is not well-equipped to perform volume detachments because detachments have extra dependencies (such as disk mounts, running processes) which TF doesn't know about. I agree that it's generally inadvisable for TF to perform OS-level actions like quiescing IO and unmounting fs.

However, I don't think it is bad (even if it's not ideal) to allow TF to manage volume attachments explicitly (via aws_volume_attachment), and to implicitly detach volumes by destroying the instances they are attached to. I think that this approach is compatible with the view that TF shouldn't perform volume detachments: by relying on instance destruction to detach volumes, TF effectively delegates volume detachment to AWS.

I also think that the TF model of failure is perfectly well equipped to handle problems that may crop up while using this approach. For example: if, during an instance destroy phase, a disk fails to unmount, then the instance may remain running, and the volume remains attached to it. TF sees that the instance failed to be destroyed, and stops execution so that the succeeding instance is never created, and volume re-attachment is not attempted. TF reports the failure to the user, who must retry by running terraform plan and terraform apply. There isn't any forced detachment, there's no disk corruption, and no un-synced state between TF and AWS.

While this approach may be less elegant and less robust than using a CM tool like Chef to handle everything EBS-related, it is, I believe, a simple, clean, and predictable solution. For users like myself who simply aren't ready to introduce a CM tool into their operations, it is also a practical solution.

@dvianello I have no idea if/when HashiCorp would incorporate my changes, unfortunately. I haven't created a PR yet, since it's not clear what HashiCorp's stance on this issue is.

I've been using my patch for a while now, which you're free to try out (at your own risk, of course) if you need a stop-gap while we wait for an official solution. I've created a release with binaries, in case that's helpful. To use it you must first add "skip_detach": true, run plan and apply on any aws_volume_attachment for which you want to enable the new behavior before trying to destroy and re-create instances.

Gary-Armstrong commented 8 years ago

Agree @maxenglander that TF could manage attachments as you say. Entire post is agreeable, in fact. I don't want to get off on a TF wish list, but it seems entirely reasonable to expect TF to detach and potentially preserve EBS when an instance is terminated.

visit1985 commented 8 years ago

I like 5c09bcc from @c4milo. I've tested it in our environment for some days now. It's the best solution for this issue. I suggest to cherry-pick that one.

jasonmoo commented 8 years ago

how is this still unsolved after a year?

gtmtech commented 8 years ago

Although the volume attachment resources above might not work, we have the whole thing working a slightly different way (although its using aws) - we define an aws_instance , and an aws_ebs_volume, and no attachement information, however we tag the aws_instance with the aws_ebs_volume resource.

Then on the instance bootup, we read the tag and attach and mount the disk. On the instance shutdown the reverse (although you dont need to)

It all works fine. - change the details of the instance and everything detaches and reattaches as intended in the immutable infra way.

Sure, it would be nice to have it in terraform, but you dont need it to get the basics working.

crolek commented 8 years ago

We have also tested @c4milo's commit https://github.com/hashicorp/terraform/commit/5c09bcc1debafd895423e1e2df0c5da4930468bc on our setup and have had great results in resolving our problem. We're going keep using this patch until this hopefully gets merged.

@c4milo thank you for adding this!

razvanm commented 8 years ago

I'm also hitting this issue. @c4milo: have you sent a PR with https://github.com/hashicorp/terraform/commit/5c09bcc1debafd895423e1e2df0c5da4930468bc?

c4milo commented 8 years ago

I did send a https://github.com/hashicorp/terraform/pull/5364 but closed it since it isn't the ideal solution to this problem as discussed in that thread.

mitchellh commented 7 years ago

This is pretty much the same as #2761, I'm sure there are other places this is being tracked too... going to close this one. (The reference here will link them, too)

redbaron commented 7 years ago

@mitchellh , arguably this issue has bigger "community" and should be considered main point of contact to track all dependency problems which can't be expressed using simplistic graph model TF is currently using.

2761 is valid issue too,but it has got 5 comments and 9 subscribers, strange choice to keep that one and close this.

carterjones commented 7 years ago

I know this thread was closed in favor of #2761, but given that that issue is still open, I wanted to leave this here for anyone else still experiencing this particular issue.

I was able to set skip_destroy to true on the volume attachment to solve this issue. Details here: https://www.terraform.io/docs/providers/aws/r/volume_attachment.html#skip_destroy

Note: in order for it to work, I had to do the following 1) set skip_destroy to true on the volume attachment 2) run terraform apply 3) make the other changes to the instance that caused it to be terminated/recreated (changing the AMI in my case) 4) run terraform apply again

Leaving this here in case anyone else finds it useful.

mpalmer commented 7 years ago

I can't get the above workaround to do the trick using 0.10.6. Looks like whatever bug was being exploited to make this work got closed.

Gary-Armstrong commented 7 years ago

I'm still only provisioning ephemerals in TF.

In fact, I am specifying four of them for every instance, every time. I then have some ruby/chef that will determine how many are really there (0-4) and do the needful to partition, lvm stripe, then mount as a single ext4.

I still use Chef to config all EBS from creation to fs mount. Works great. EBS persist unless defined otherwise. Mentally assigning all volume management to the OS arena has gotten me where I want to be.

exolab commented 7 years ago

This is still an issue 26 months after the issue was first created.

c4milo commented 7 years ago

@exolab, It is not. You need to use destroy-time provisioners in order to unmount the EBS volume.

exolab commented 7 years ago

Sorry if I am a bit daft. How so?

Is this what you are suggesting?

provisioner "remote-exec" {
    inline = ["umount -A"]

    when   = "destroy"
  }
Mykolaichenko commented 7 years ago

Also with @mpalmer not working fix with skip_destroy using terraform 10.6 😞

GarrisonD commented 6 years ago

Fix with skip_destroy does not work using terraform 11.1 😢

smastrorocco commented 6 years ago

+1

OneSpecialPeg commented 6 years ago

Still an issue (and a big issue for us) in v0.11.3

jangrewe commented 6 years ago

Still an issue in v0.11.4

devsecops-dba commented 6 years ago

terraform v0.11.7 -- have same issue with volumeattachment when running destroy; skip_destroy = true in volume attachment resource is not helping either - destroy keeps trying. went ahead force detached from console - then tried destroy moved forward at that time. Is there default timeout for TF - script kept running destroy until I ctrl C out of it -- trying to detach ebs ovl.

mmacdermaid commented 6 years ago

On Terraform v0.11.7 I was able to get around this by creating the volume attachment with

force_detach = true

if you created it without the force detach to be true it will still fail. I had to terminate the instance, allow the edit or recreation of the volume attachment to have force detach, and then all subsequent detaches work for me.

davidvuong commented 5 years ago

Using force_detect = true worked for me as well (v0.11.7).

Originally created the volume without force_detect so I had go manually force detach in the AWS console, then delete the volume (in Terraform) and re-create (also in Terraform) before it worked.

JasonGilholme commented 5 years ago

Still an issue.

Is there any issue using force_detach? I'm assuming that processes could still be trying to use the volume. (?) Is there a way to stop the instance prior to detaching the volume and then terminate it?

aaronpi commented 5 years ago

Still an issue.

Is there any issue using force_detach? I'm assuming that processes could still be trying to use the volume. (?) Is there a way to stop the instance prior to detaching the volume and then terminate it?

I know this issue is closed, but just as a example workaround for this for people finding this, I'll post what I've done. I have a volume I want to persist between machine rebuilds (gets rebuilt from a snapshot if deleted but otherwise persisted). What I did was grab the older instance id in TF, then a local-exec (can't use remote-exec with how direct access to the machine is gated) to use the aws cli to to shutdown the machine the volume is being detached from first before destroy and rebuild of the machine and the volume attachment:

//data source to get previous instance id for TF workaround below
data "aws_instance" "example_previous_instance" {
  filter {
    name = "tag:Name"
    values = ["${var.example_instance_values}"]
  }
}

//volume attachment
resource "aws_volume_attachment" "example_volume_attachment" {
  device_name = "/dev/xvdf"
  volume_id   = "${aws_ebs_volume.example_volume.id}"
  instance_id = "${aws_instance.example_instance.id}"
  //below is a workaround for TF not detaching volumes correctly on rebuilds.
  //additionally the 10 second wait is too short for detachment and force_detach is ineffective currently
  //so we're using a workaround: using the AWS CLI to gracefully shutdown the previous instance before detachment and instance destruction
  provisioner "local-exec" {
    when   = "destroy"
    command = "ENV=${var.env} aws ec2 stop-instances --instance-ids ${data.aws_instance.example_previous_instance.id}"
  }
}
ghost commented 5 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.