hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.

https://registry.terraform.io/providers/hashicorp/aws

Mozilla Public License 2.0

9.76k stars 9.12k forks source link

[Bug]: Wrong output when Elastic IP was reassigned #37584

Open EugenKon opened 4 months ago

EugenKon commented 4 months ago

Terraform Core Version

v1.8.2

AWS Provider Version

v4.67.0

Affected Resource(s)

aws_eip.manage-cluster
aws_instance.zzz-public

Expected Behavior

Changes to Outputs:
  ~ zzz = "zzz: 44.XX.YY.90" -> "zzz: 54.XX.YY.143"

Actual Behavior

Changes to Outputs:
  ~ zzz = "zzz: 54.XX.YY.143" -> "zzz: 44.XX.YY.90"

If I rerun the plan/apply, then I can see the correct values:

Changes to Outputs:
  ~ zzz = "zzz: 44.XX.YY.90" -> "zzz: 54.XX.YY.143"

Though it should be done only by one run of plan/apply.

Output does not corresponds to the actual value:

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

resource "aws_instance" "zzz-public" {
  ...
}

resource "aws_eip" "manage-cluster" {
  instance = aws_instance.zzz-public.id          # This line was changed to assign Elastic ip to zzz-public instance

  tags = {
    Name = "Public IP for a 'fleetctl' cluster management"
  }
}

output "zzz" {
  value = "zzz: ${aws_instance.zzz-public.public_ip}"
}

...

  # aws_eip.manage-cluster will be updated in-place
  ~ resource "aws_eip" "manage-cluster" {
        id                       = "eipalloc-zzz"
      ~ instance                 = "i-zzz" -> "i-xxx"
        tags                     = {
            "Name" = "Public IP for a 'fleetctl' cluster management"
        }
        # (15 unchanged attributes hidden)
    }

  # aws_instance.zzz will be updated in-place
  ~ resource "aws_instance" "zzz" {
        id                                   = "i-zzz"
      ~ tags                                 = {
          - "Copy"     = "yes" -> null
          ~ "Name"     = "zzz-copy" -> "zzz-public"
            "multires" = "1"
            "tmp"      = "1"
        }
      ~ tags_all                             = {
          - "Copy"     = "yes" -> null
          ~ "Name"     = "zzz-copy" -> "zzz-public"
            # (2 unchanged elements hidden)
        }
        # (37 unchanged attributes hidden)

        # (9 unchanged blocks hidden)
    }

Plan: 0 to add, 2 to change, 0 to destroy.

Steps to Reproduce

create a cluster with an instance and elastic ip
create a second instance there
change configuration to reassign elastic ip to the new instance
terraform plan -target aws_instance.zzz -target aws_eip.manage-cluster -out copy
terraform apply copy
terraform plan -target aws_instance.zzz -target aws_instance.zzz-public -out ip
terraform apply ip
After second run on step 7 I can see actual values

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 4 months ago

Community Note

Voting for Prioritization

Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
Please see our prioritization guide for information on how we prioritize.
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

If you are interested in working on this issue, please leave a comment.
If this would be your first contribution, please review the contribution guide.

justinretzolk commented 4 months ago

Hey @EugenKon 👋 Thank you for taking the time to raise this! In this case, this is how I would expect this to behave, and is more related to Terraform Core than the AWS Provider itself.

Looking at your example configuration and notes, the output is using an attribute from the aws_instance, but that resource isn't dependent on the aws_eip. Since there's no dependency in that direction, the aws_instance is likely being updated in the state before the aws_eip has had a chance to be applied and the public IP changed. On the next plan/apply, this gets reconciled and the output gets updated.

Unfortunately, in this case you can't just add an expliict depends_on, the aws_eip is dependant on the aws_instance. With that in mind, could you update your configuration so that the output uses the aws_eip.public_ip attribute work for your configuration instead? This shouldn't need the same two-stage apply to work.

EugenKon commented 4 months ago

We have two instances: zzz and zzz-public. The ElasticIP could be assigned to either host. I do not think we can use aws_eip.public_ip, because here we want to know the IP for each host we have and not the EIP, which could be on one or the other host.

Below I want to provide more details how we change the IP between hosts (Please look at it as separate example, which does not belongs to the cluster in the first post):

resource "aws_instance" "zzz1" {
}

resource "aws_instance" "zzz2" {
  ...
}

resource "aws_eip" "manage-cluster" {
  instance = aws_instance.zzz2.id          # This was chanaged from zzz1 to zzz2
}

output "zzz1" {
  value = "zzz1: ${aws_instance.zzz1.public_ip}"
}

output "zzz2" {
  value = "zzz2: ${aws_instance.zzz2.public_ip}"
}

  # aws_eip.manage-cluster will be updated in-place
  ~ resource "aws_eip" "manage-cluster" {
        id                       = "eipalloc-*"
      ~ instance                 = "i-*1" -> "i-*2"
        # (15 unchanged attributes hidden)
    }

From the plan above it even does not show that IP will be changed for zzz host.

First plan/apply:
zzz1 = "zzz1: x.143"
zzz2 = "zzz2: y"

Second plan:
  # aws_instance.zzz2 has changed
  ~ resource "aws_instance" "zzz2" {
        id                                   = "i-*2"
      ~ public_ip                            = "y" -> "x.143"
        # (38 unchanged attributes hidden)

        # (9 unchanged blocks hidden)
    }

  # aws_instance.zzz1 has changed
  ~ resource "aws_instance" "zzz1" {
        id                                   = "i-1"
      ~ public_ip                            = "x.143" -> "new.IP"
        # (37 unchanged attributes hidden)
...
Changes to Outputs:
  ~ zzz1 = "zzz1: x.143" -> "zzz1: new.IP"
  ~ zzz2 = "zzz2: y" -> "zzz2: x.143"

Second apply:

I understand, that there is some dependencies. But, I suppose, you could agree that users could do not know that dependency and the internals of Terraform. And in this case it leads to unexpected results, because the actual host IP is different in compare to TF state.

In this case, I believe, you should update Terraform internals.

EugenKon commented 4 months ago

Interesting. If I comment out both:

resource "aws_instance" "zzz1" {
}
output "zzz1" {
  value = "zzz1: ${aws_instance.zzz1.public_ip}"
}

I see in my plan that zzz1 and output will be removed. But when I use -target aws_instance.zzz1 it still shows output for zzz1:

Apply complete! Resources: 0 added, 0 changed, 1 destroyed.

Outputs:

zzz1 = "new.IP"
zzz2 = "zzz2: x.143"

If outputs for zzz1 depends on zzz1 host, why it was not removed? I suppose, this issue with output still belongs to the original issue in a someway. Because it is how output works. In this case it completely was not updated, though it directly depends on deleted object. In the other ticket you told me that dependent objects are still updated even if -target was used.

EugenKon commented 4 months ago

If I do vice-versa: adding the host and output back, then I get expected results: The IP address for output is updated.

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

zzz1 = "new.IP.2"
zzz2 = "zzz2: x.143"

Thus updates to output do not work only when instance is destroyed.

justinretzolk commented 4 months ago

Hey @EugenKon 👋 Thanks for the additional context, and for that standalone example -- that's really helpful.

From the plan above it even does not show that IP will be changed for zzz host.

This is a very important callout, and is the root of why you're seeing a need to do a double apply with how you have things configured. This is oversimplified, but the steps of the plan/apply after changing aws_eip.manage-cluster.instance would essentially be:

First plan/apply:

aws_instance.zzz1 and aws_instance.zzz2 are read and then written to state in parallel, since neither of them are dependent on anything else (at least as far as we're concerned for this particular example).
After aws_instance.zzz2 is finished updating, aws_eip.manage-cluster begins to be read and updated (respecting the depedency introduced by the configuration change). This will change the IP for both of the instances.
The outputs given are interpolated from the state and returned.

Because the aws_instance resources were updated in the state prior to being modified by aws_eip.manage-cluster, the output is the outdated values for their public_ip attribute.

Second plan/apply:

aws_instance.zzz1 and aws_instance.zzz2 are read and then written to state in parallel. Both have changed since Terraform was last run (by aws_eip.manage-cluster), so their public_ip attribute is updated in the state.
aws_eip.manage-cluster is refreshed, with no changes detected.
The outputs given are interpolated from the state and returned, with the correct values.

Ultimately, I agree with you that I wouldn't expect the average Terraform user to inherently recognize this. In reviewing the documentation to see how we could address this, however, I noticed that we mention something about this already:

public_ip - Public IP address assigned to the instance, if applicable. NOTE: If you are using an aws_eip with your instance, you should refer to the EIP's address directly and not use public_ip as this field will change after the EIP is attached.

So the documentation would typically recommend what I'd mentioned before -- using the aws_eip resources to get the IP. Since that won't work for you, introducing an aws_instance data source for each of your aws_instance resources might. I wouldn't typically recommend this approach, but by marking these data sources as dependent on the aws_eip.manage-cluster, you can force Terraform to read them after the aws_eip has been applied. Using the output of the data sources in your output interpolations should get you the correct IP addresses in one apply.

As far as applies with -target and how they interact with outputs, I'd have to defer to someone on the Terraform Core team. That's Core functionality that I'm unfortunately a bit hazy on the details of, having not had to touch it in a while.

EugenKon commented 4 months ago

@justinretzolk Do I understand correct that I should add lifecycle.depends_on configuration option to the both instances?

justinretzolk commented 2 months ago

@EugenKon for the data sources, correct.

EugenKon commented 2 months ago

I can check this in a few weeks later.

justinretzolk commented 1 month ago

Hey @EugenKon, just checking in here to see if you were able to test this any further, or whether we should close this one out.

EugenKon commented 1 month ago

Sorry. No, I have not worked on this task. Probably next month.

EugenKon commented 3 weeks ago

It looks like it does not belongs to Elastic IP. We have issues when IP address was changed for some reason. To fix this we need to run plan/apply twice.

For the next change:

@@ -145,7 +145,7 @@ resource "aws_instance" "consul_server" {
   iam_instance_profile        = aws_iam_instance_profile.www.name

   user_data = templatefile("shared/data-scripts/user-data-consul-server.sh.tpl", {
-    ud_hostname     = "consul-server-${count.index}"
+    ud_hostname     = "consul-server-${var.project_name}-${count.index}"
     ud_aws_region   = var.aws_region
     ud_project_name = var.project_name
     ud_server_count = local.server_count
@@ -220,7 +220,7 @@ resource "aws_instance" "www" {
   iam_instance_profile        = aws_iam_instance_profile.www.name

   user_data = templatefile("shared/data-scripts/user-data-consul-client.sh.tpl", {
-    ud_hostname     = "www"
+    ud_hostname     = "www-${var.project_name}-${count.index}"
     ud_aws_region   = var.aws_region
     ud_project_name = var.project_name
     ud_retry_join   = local.retry_join

The plan is:

  # module.private-cloud.aws_instance.consul_server[0] will be updated in-place
  ~ resource "aws_instance" "consul_server" {
        id                                   = "i-0de6eb"
        tags                                 = {
            "ConsulAutoJoin" = "nomad-auto-join"
            "Name"           = "nomad-consul-server-0"
            "NomadType"      = "server"
        }
      ~ user_data                            = "e01f00525cd6dfec0a8b73X" -> "5823a3c6e4ba6e9ad983Y"
        # (39 unchanged attributes hidden)

        # (10 unchanged blocks hidden)
    }

  # module.private-cloud.aws_instance.www[0] will be updated in-place
  ~ resource "aws_instance" "www" {
        id                                   = "i-0db83"
        tags                                 = {
            "ConsulAutoJoin" = "nomad-auto-join"
            "Name"           = "nomad-www"
            "NomadType"      = "client"
        }
      ~ user_data                            = "e8302a79f848df0f7X" -> "480594b59deeae93Y"
        # (39 unchanged attributes hidden)

        # (9 unchanged blocks hidden)
    }

Plan: 0 to add, 2 to change, 0 to destroy.

It was applied successfully:

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.

Outputs:

ConnectionInfo = <<EOT

Client EC2 instances: 34.221.137.248
Server EC2 instances: 35.91.90.127
....

If I run plan again I get this output:

Terraform will perform the following actions:

  # module.dns.aws_route53_record.photos-private-cloud[0] will be updated in-place
  ...
  # module.dns.aws_route53_record.tour-private-cloud[0] will be updated in-place
  ~ resource "aws_route53_record" "tour-private-cloud" {
        id                               = "Z0_tour.XXX_A"
        name                             = "tour.XXX"
      ~ records                          = [
          - "34.221.137.248",
          + "35.X.Y.Z",
        ]
        # (7 unchanged attributes hidden)
    }

Plan: 0 to add, 6 to change, 0 to destroy.

Changes to Outputs:
  ~ ConnectionInfo      = <<-EOT
      - Client EC2 instances: 34.221.137.248
      + Client EC2 instances: 35.X.Y.Z
      - Server EC2 instances: 35.91.90.127
      + Server EC2 instances: 54.X.Y.N
...

The parts of configuration files:

locals {
  consul_server_ip = var.vpn_setup == "client-ssh-only" ? module.private-cloud.consul_server_private_ip : module.private-cloud.consul_server_public_ip
  consul_client_ip = var.vpn_setup == "client-ssh-only" ? module.private-cloud.www_private_ip : module.private-cloud.www_public_ip
}
output "ConnectionInfo" {
  value = <<CONFIGURATION

Client EC2 instances: ${join(", ", local.consul_client_ip[*])}
Server EC2 instances: ${join(", ", local.consul_server_ip[*])}

### private-cloud/output.tf
output "www_public_ip" {
  description = "This is public ip address of the application"
  value       = aws_instance.www[*].public_ip
}

output "www_private_ip" {
  description = "This is the private ip address of the application"
  value       = aws_instance.www[*].private_ip
}

output "consul_server_public_ip" {
  description = "This is public ip address of the application"
  value       = aws_instance.consul_server[*].public_ip
}

output "consul_server_private_ip" {
  description = "This is the private ip address of the application"
  value       = aws_instance.consul_server[*].private_ip
}

### private-cloud/ec2.tf
resource "aws_instance" "consul_server" {
  ...
  user_data = templatefile("shared/XX.tpl", {
    ud_hostname     = "consul-server-${var.project_name}-${count.index}"
    ...
resource "aws_instance" "www" {
  user_data = templatefile("shared/YY.tpl", {
    ud_hostname     = "www-${var.project_name}-${count.index}"
  ...