hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.31k stars 9.49k forks source link

All `vpc_security_group_ids` being removed when updating instance Security Groups #9043

Closed scottgeary closed 7 years ago

scottgeary commented 7 years ago

When trying to modify AWS Security Groups on an already Terraformed EC2 instance, it seems to forget about all existing SGs (even though they appear in the statefile), and the plan output initially looks legit.

We originally found this issue, when trying to add a brand-new SG. Terraform removed all existing SGs (and caused some chaos!), but successfully added the new one. We've since reduced the problem down to just removing an existing SG, or somewhere parsing the existing SGs from the vpc_security_group_ids field.

In the following example:

Terraform v0.7.4

Affected Resource(s)

Existing Statefile showing existing 4 SGs:

                "aws_instance.ganglia-a": {
                    "type": "aws_instance",
                    "depends_on": [
                        "aws_security_group.CCCCCC",
                        "aws_security_group.DDDDDD",
                        "aws_security_group.AAAAAA",
                        "aws_subnet.blah-public",
                        "data.template_file.ganglia-cloudinit"
                    ],
                    "primary": {
                        "id": "i-INSTANCE-ID-HERE",
                        "attributes": {
                            "ami": "ami-AMI-ID",
                            "associate_public_ip_address": "false",
                            "availability_zone": "us-west-2a",
                            "disable_api_termination": "false",
                            "ebs_block_device.#": "0",
                            "ebs_optimized": "false",
                            "ephemeral_block_device.#": "0",
                            "iam_instance_profile": "",
                            "id": "i-INSTANCE-ID-HERE",
                            "instance_state": "running",
                            "instance_type": "m4.large",
                            "key_name": "dom",
                            "monitoring": "false",
                            "network_interface_id": "eni-xxxxxx",
                            "private_dns": "blah.compute.internal",
                            "private_ip": "SOME_IP",
                            "public_dns": "blah.compute.amazonaws.com",
                            "public_ip": "SOME_IP",
                            "root_block_device.#": "1",
                            "root_block_device.0.delete_on_termination": "true",
                            "root_block_device.0.iops": "192",
                            "root_block_device.0.volume_size": "64",
                            "root_block_device.0.volume_type": "gp2",
                            "security_groups.#": "0",
                            "source_dest_check": "true",
                            "subnet_id": "subnet-ID",
                            "tags.%": "2",
                            "tags.Cost Center": "Platform",
                            "tags.Name": "ganglia",
                            "tenancy": "default",
                            "user_data": "SOME_HASH_HERE",
                            "vpc_security_group_ids.#": "4",
                            "vpc_security_group_ids.1389827218": "sg-AAAAAA",
                            "vpc_security_group_ids.258711328": "sg-DDDDDD",
                            "vpc_security_group_ids.4034046821": "sg-CCCCCCC",
                            "vpc_security_group_ids.4249215419": "sg-BBBBBB"
                        },
                        "meta": {
                            "schema_version": "1"
                        },
                        "tainted": false
                    },
                    "deposed": [],
                    "provider": ""
                }

Existing Terraform file:

# ganglia
resource "aws_instance" "ganglia-a" {
  ami = "${var.ubuntu_ami_id}"
  availability_zone = "${var.vpc_region}a"
  instance_type = "m4.large"
  key_name = "dom"
  vpc_security_group_ids = ["${aws_security_group.AAAAAA.id}", "${aws_security_group.BBBBBB.id}", "${aws_security_group.CCCCCC.id}", "${aws_security_group.DDDDDD.id}" ]
  subnet_id = "${aws_subnet.blah-public.0.id}"
  user_data = "${data.template_file.SOMEDATA.rendered}"

  lifecycle {
    ignore_changes = ["user_data"]
  }

  root_block_device {
    volume_type = "gp2"
    volume_size = "${var.default_volume_size}"
  }

  provisioner "local-exec" {
    command = "../../../scripts/autosign ${var.foreman_proxy_url} ganglia-${replace(aws_instance.ganglia-a.id, "i-", "")}.${var.vpc_domain}"
  }

  tags {
    Name = "ganglia"
    "Cost Center" = "Platform"
  }
}

Actual change we're trying to apply (removal of BBBBBB)

-  vpc_security_group_ids = ["${aws_security_group.AAAAAA.id}", "${aws_security_group.BBBBBB.id}", "${aws_security_group.CCCCCC.id}", "${aws_security_group.DDDDDD.id}"]
+  vpc_security_group_ids = ["${aws_security_group.AAAAAA.id}", "${aws_security_group.CCCCCC.id}", "${aws_security_group.DDDDDD.id}"]

Terraform Plan

~ aws_instance.ganglia-a
    vpc_security_group_ids.#:          "4" => "3"
    vpc_security_group_ids.4249215419: "sg-BBBBBB" => ""

Debug Output

kwilczynski commented 7 years ago

@scottgeary hi there! I am very sorry that you are affected by this!

We are having a look, as this is a very hard to catch issue, as I personally could not reproduce it successfully. I will keep you posted once we find out what might be creeping there!

rafaelmagu commented 7 years ago

I walked @stack72 through this on my machine, but we still couldn't figure out why. Does not appear to be an API-limit-related issue.

br0ch0n commented 7 years ago

@scottgeary After two days of head banging I just realized that I'm having this exact same problem. Mine manifested when I tried to add 1 SG to an existing array of 4. Repeated applies flip-flop between just the new 1 and just the original 4. When doing a delete it's the same behavior except aws won't let you send it 0 SGs. Otherwise yours would flip-flop too between setting your instance to 4 then 0 then 4 then 0.

I believe it's yet another ignore_changes bug. I had just recently added ignore_changes['user_data'] to my resource just like above. I thought it was working now (to be fair, it does ignore the user_data now) :(

One can reliably reproduce the issue like below. First, create an instance:

resource "aws_instance" "mystery_test" {
  provider = "aws.mystery"
  instance_type = "t2.nano"
  user_data = "foo"
  lifecycle {
    ignore_changes = [ "user_data" ]
  }
  vpc_security_group_ids = ["sg-AAAAAAAA", "sg-BBBBBBBB"]
  ami       = "ami-a1288ec2"
  subnet_id = "${var.subnetid}"
}

Then, change user_data "foo" to "bar" and also add a new SG to the array.

user_data = "bar"
  lifecycle {
    ignore_changes = [ "user_data" ]
  }
  vpc_security_group_ids = ["sg-AAAAAAAA", "sg-BBBBBBBB", "sg-CCCCCCCC]

a plan will give you strange output:

~ module.mysterymod.aws_instance.mystery_test
    vpc_security_group_ids.#:          "2" => "3"
    vpc_security_group_ids.3333333333: "" => "sg-CCCCCCCC"

Note how the plan doesn't show the expected

vpc_security_group_ids.1111111111: "sg-AAAAAAAA" => "sg-AAAAAAAA"
vpc_security_group_ids.222222222: "sg-BBBBBBBB" => "sg-BBBBBBBB"

entries. This is bad :) In the trace log of the apply, you'll see it POST just

Action=ModifyInstanceAttribute&GroupId.1=sg-CCCCCCCC&InstanceId=i123455678&Version=2016-09-15

The action should be all 3 groups (GroupId.1 and GroupId.2 and GroupId.3) What seems to be happening is that ignore_changes catches the user_data change (foo => bar) which then makes it ignore other diffs (or at least mess things up)

2016/10/26 14:35:37 [DEBUG] root.mystery: eval: *terraform.EvalDiff
2016/10/26 14:35:37 [DEBUG] Removing 'id' diff and setting Destroy to false because after ignore_changes, this diff no longer requires replacement
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: user_data
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: vpc_security_group_ids.11111111
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: security_groups.#
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: root_block_device.#
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: ebs_block_device.#
...

If you then remove the ignore_changes line and run a plan at this point, you'll get a plan that of course wants to recreate the instance, but the sg change will at least look correct:

-/+ module.mystery.aws_instance.mystery
    ami:                               "ami-a1288ec2" => "ami-a1288ec2"
...
    user_data:                         "0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33" => "62cdb7020ff920e5aa642c3d4066950dd1f01f4d" (forces new resource)
    vpc_security_group_ids.#:          "2" => "3"
    vpc_security_group_ids.3333333333: "" => "sg-CCCCCCCC"
    vpc_security_group_ids.11111111111: "sg-AAAAAAAA" => "sg-AAAAAAAA"
    vpc_security_group_ids.2222222222: "sg-BBBBBBBB" => "sg-BBBBBBBB"

I of course added ignore_changes recently because I needed to update my user_data without reprovisioning lots of nodes, thus I'm now stuck in a state where I get to pick between adding a SG and reprovisioning lots of nodes :(

I will add that as a workaround, you can manually attach the new SG to your instance via console/cli and Terraform will not complain about it. Probably works for removing one also.

Anyway, seems ignore_changes needs to be examined further. Hope this helps! @phinze

br0ch0n commented 7 years ago

Just to add a bit more proof, I played a bit last night with eval_diff.go and tried (around line 240) :

   // If we didn't hit any of our early exit conditions, we can filter the diff.
    for k := range ignorableAttrKeys {
        if !(strings.Contains(k, "vpc_security_group_ids")) {
            log.Printf("[DEBUG] [EvalIgnoreChanges] %s - Ignoring diff attribute: %s",
                n.Resource.Id(), k)
            diff.DelAttribute(k)
        }
    }

running this against my earlier example resource results in the correct plan (didn't try applying). This suggests that one approach might be to make sure we're not deleting diff attributes that we care about in this particular run. Note that if I run a plan using the above hacked TF against other resources that have a pending recreate due to user data, are blocked from doing so by ignore_changes, but have no other changes (vpc_security_group_ids or otherwise), I get a useless diff in my plan for all of them, e.g.

"A" => "A"
"B" => "B"
"C" => <computed>

Looking further it seems the problem is a bit earlier where we do:

   if changeType == DiffDestroyCreate {
        for k, v := range diff.CopyAttributes() {
            if v.Empty() || v.NewComputed {
                ignorableAttrKeys[k] = true
            }
        }

problem being that this ends up deleting part of our diff, e.g. A => A, since that's matched by .Empty(). This is likely a problem for all list types of attributes. Seems that we should look for the existence of a parent counter key, e.g. "vpc_security_group_ids.#" and if it's changing, we should exclude all child entries from the ignorableAttrKeys map.

br0ch0n commented 7 years ago

What we actually need to do is check if any siblings are changing. Potential fix in #9791

ekristen commented 7 years ago

I'm seeing this bug too, I'm adding a new security group to the instance and the original 2 are disappearing and it everytime I run plan it flips flops back and forth.

~ aws_instance.management.0
    vpc_security_group_ids.#:          "2" => "3"
    vpc_security_group_ids.1111564906: "" => "sg-aaaa4197"

then

~ aws_instance.management.0
    vpc_security_group_ids.#:          "1" => "3"
    vpc_security_group_ids.111142371:  "" => "sg-aaaa417a"
    vpc_security_group_ids.1111450837: "" => "sg-aaaa4184"
Gary-Armstrong commented 7 years ago

I'm seeing this as well. Earlier in the week it was that a vpc_security_group_ids entry was not being removed, and so I removed it via console. The plan output looked much like ekristen's comment. This was relatively benign and so I didn't bother to enter an issue on it.

Today however, this same issue had some destructive behavior. It removed all SG except the one I was adding; luckily I was doing this on a dev machine. But I must proceed with modification of SG across many more instances.

Early warning from plan:

~ aws_instance.compute_100_dev
    vpc_security_group_ids.#:         "3" => "4"
    vpc_security_group_ids.423628628: "" => "sg-asdfasdf"

Count goes to 4 so I'm sure everything will be OK and so I apply.

aws_instance.compute_100_dev: Modifying...
  vpc_security_group_ids.#:         "3" => "4"
  vpc_security_group_ids.423628628: "" => "sg-asdfasdf"
aws_instance.compute_100_dev: Modifications complete

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Close enough, right? Probably just some display error.

Then I validate in console and the only SG attached to the instance is sg-asdfasdf, so I begin to cry again. I wipe my face and then plan:

~ aws_instance.compute_100_dev
    vpc_security_group_ids.#:          "1" => "4"
    vpc_security_group_ids.1967715036: "" => "sg-12341234"
    vpc_security_group_ids.2402598217: "" => "sg-qwerqwer"
    vpc_security_group_ids.257549500:  "" => "sg-zxcvzxcv"

Plan: 0 to add, 1 to change, 0 to destroy.

Hooray. This looks more correct. I mean, it isn't showing sg-asdfasdf but surely this is just a display bug. Apply:

aws_instance.compute_100_dev: Modifying...
  vpc_security_group_ids.#:          "1" => "4"
  vpc_security_group_ids.1967715036: "" => "sg-12341234"
  vpc_security_group_ids.2402598217: "" => "sg-qwerqwer"
  vpc_security_group_ids.257549500:  "" => "sg-zxcvzxcv"
aws_instance.compute_100_dev: Modifications complete

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Console reveals that I again have only 3 SG and the new one isn't being put into place. It is too early in the day to start drinking.

Maybe I can make a workaround. Advice appreciated.

br0ch0n commented 7 years ago

@ekristen @Gary-Armstrong can you guys test my fix from #9791 ? Maybe it'll get more traction if it works for you.

Gary-Armstrong commented 7 years ago

Update:

  vpc_security_group_ids = [
    "${aws_security_group.wxmix_http_sg.id}",
    "${aws_security_group.wxmix_admin_sg.id}"
  ]

Using this HCL where the 2 SG are completely unrelated, when I do a plan/apply cycle it will switch to whichever SG isn't applied to the instance.

] aws ec2 describe-instances --instance-id i-xxxxxxxx|jq .Reservations[].Instances[].SecurityGroups[]
{
  "GroupName": "wxmix_http_sg",
  "GroupId": "sg-aaaaaaaa"
}
...TF APPLY...
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
...
] aws ec2 describe-instances --instance-id i-xxxxxxxx|jq .Reservations[].Instances[].SecurityGroups[]
{
  "GroupName": "wxmix_admin_sg",
  "GroupId": "sg-bbbbbbbb"
}

Just by doing an apply, it will flip back and forth between the two SG.

Gary-Armstrong commented 7 years ago

More experiments reveal the pattern (in my case) is to replace existing list items with new list items, rather than merge the list and replace existing list items with new merged list items.

Gary-Armstrong commented 7 years ago

@br0ch0n If it involves compiling a new terraform, I'll not participate. I have to manually change SG assignments on many instances today.

mpasternacki commented 7 years ago

Still happens in 0.8.6

br0ch0n commented 7 years ago

@stack72 is the prototype you mentioned in #1887 meant to work around this as well (i.e. you're planning on completely refactoring ignore_changes)? Otherwise can we move forward with #9791? Happy to tweak it if it's not an ideal solution.

cswilliams commented 7 years ago

just got bit by this bug as well today. Added a new security group to my vpc_security_group_ids list and it removed all existing security groups from my instance and only added the new security group.

catsby commented 7 years ago

Thanks for the research here. I've reproduced this issue with the following gist and instructions above:

I'm relabeling this as a core bug, the root of it seems to involve ignore_changes as mentioned.

catsby commented 7 years ago

Hey all – looks like this should be patched up with #12897 , will go out in the next release

ghost commented 4 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.