hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.85k stars 9.2k forks source link

[Bug]: allow_overwrite=true for aws_route53_record updates the actual record but not the state #39889

Open dsantanu opened 4 weeks ago

dsantanu commented 4 weeks ago

Terraform Core Version

1.6.4

AWS Provider Version

5.61.0

Affected Resource(s)

aws_route53_zone

Expected Behavior

In the console, it should return these:

> aws_route53_zone.arz_test.name_servers
tolist([
  "ns1.ibx.zzzdevops.co.uk",
  "ns2.ibx.zzzdevops.co.uk",
  "ns3.ibx.zzzdevops.co.uk",
  "ns4.ibx.zzzdevops.co.uk",
])

The records = [...] from applying aws_route53_record.test_sub_ns resource should be:

Terraform will perform the following actions:

  # aws_route53_record.test_sub_ns will be created
  + resource "aws_route53_record" "test_sub_ns" {
      + allow_overwrite = (known after apply)
      + fqdn            = (known after apply)
      + id              = (known after apply)
      + name            = "ibx.zzzdevops.co.uk"
      + records         = [
          + "ns1.ibx.zzzdevops.co.uk",
          + "ns2.ibx.zzzdevops.co.uk",
          + "ns3.ibx.zzzdevops.co.uk",
          + "ns4.ibx.zzzdevops.co.uk",
        ]
      + ttl             = 50
      + type            = "NS"
      + zone_id         = "ZO24E812PQO08DUTIXL0"
    }

Actual Behavior

In terraform console:

> aws_route53_zone.arz_test.name_servers
tolist([
  "ns-1475.awsdns-56.org",
  "ns-1843.awsdns-38.co.uk",
  "ns-461.awsdns-57.com",
  "ns-968.awsdns-57.net",
])

Result from aws_route53_record.test_sub_ns resource apply:

Terraform will perform the following actions:

  # aws_route53_record.test_sub_ns will be created
  + resource "aws_route53_record" "test_sub_ns" {
      + allow_overwrite = (known after apply)
      + fqdn            = (known after apply)
      + id              = (known after apply)
      + name            = "ibx.zzzdevops.co.uk"
      + records         = [
          + "ns-1475.awsdns-56.org",
          + "ns-1843.awsdns-38.co.uk",
          + "ns-461.awsdns-57.com",
          + "ns-968.awsdns-57.net",
        ]
      + ttl             = 50
      + type            = "NS"
      + zone_id         = "ZO24E812PQO08DUTIXL0"
    }

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

resource "aws_route53_zone" "arz_test" {
  name          = "ibx.zzzdevops.co.uk"
  comment       = "TEST Hosted Zone"
  force_destroy = true

  delegation_set_id = "NO53847633UQIJ2PQ47CM"
}

// A-Record(s) for white-lable Name servers
resource "aws_route53_record" "test_wl_a" {
  for_each = toset([for ix in range(1, 5) : tostring(ix)])
  zone_id  = aws_route53_zone.arz_test.zone_id
  name     = "ns${each.value}"
  records  = [local.delegation_set_ips["zzzdevops-secondary"][(each.value - 1)]]
  type     = "A"
  ttl      = "300"
}

// Update NS with white-label name servers
resource "aws_route53_record" "test_wl_ns" {
  allow_overwrite = true
  name            = aws_route53_zone.arz_test.name
  ttl             = 7200
  type            = "NS"
  zone_id         = aws_route53_zone.arz_test.zone_id

  records = [
    for idx in range(1, 5) :
    aws_route53_record.test_wl_a["${idx}"].fqdn
  ]
}

#### Add sub-domain NS-record in the parent-zone ####
resource "aws_route53_record" "test_sub_ns" {
  name       = aws_route53_zone.arz_test.name
  ttl        = "50"
  type       = "NS"
  zone_id    = "ZO24E812PQO08DUTIXL0"
  records    = aws_route53_zone.arz_test.name_servers
  depends_on = [aws_route53_record.test_wl_ns]
}

Steps to Reproduce

terraform init terraform apply

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 4 weeks ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

justinretzolk commented 3 weeks ago

Hey @dsantanu 👋 Thank you for taking the time to raise this! Reviewing your report, I believe I know what's happening here. First, we'll need to look at the dependencies between the resources:

With those dependencies in mind, Terraform will create the zone first, followed by test_wl_a, then test_wl_ns, and finally test_sub_ns. Since the records are all dependent on the zone, Terraform must wait until the zone is completed (and has thus been saved to the state) before proceeding with the record creation.

What this means with your configuration is that what is saved in the state for aws_route53_zone.arz_test.name_servers is the data prior to aws_route53_record.test_wl_a and aws_route53_record.test_wl_ns being applied. Since aws_route53_record.test_sub_ns references that value for its records argument, the value (now outdated) is read from the state and used. On subsequent applies with no other changes having been made, this drift is rectified.

If you can provide them, debug logging (redacted as needed) may help confirm this theory. If I'm correct, this would be working as expected, as the order of operations is determined by Terraform Core. In that case, you'd need to refactor a bit in order to get around this.

dsantanu commented 3 weeks ago

thanks @justinretzolk for replying. I'll provide debug logging as soon as I can.

I think I understood the order of creation as you describled but in that case, allow_overwrite in aws_route53_record.test_wl_ns resouce doesn't seem to be doing what it supppose to do accotding to the documentation. It says:

allow_overwrite - (Optional) Allow creation of this record in Terraform to overwrite an existing record, if any. This does not affect the ability to update the record in Terraform and does not prevent other resources within Terraform or manual Route 53 changes outside Terraform from overwriting this record. false by default. This configuration is not recommended for most environments.

As it's modifying an existing zone (with existing state data) it should update the state as well - don't you think? But inversely, it manages to ignore the drift as well, which is technically wrong (or misleading) as well, IMO, in that context.

AFAIU, the A-records by aws_route53_record.test_wl_a cannot be created until the hosted-zone is created and the moment hosted-zone is created, it already has the default name-servers attached to it, which is being saved in the state file and not being updated at the later stage, when the NS-record is being updated by a diffrent resouce (aws_route53_record.test_wl_ns) in the same zone. Not sure is there a way to get around that order of creation.

justinretzolk commented 3 weeks ago

As it's modifying an existing zone (with existing state data) it should update the state as well - don't you think?

Since aws_route53_record.test_wl_ns and aws_route53_zone.arz_test are distinct resources, the former won't update the state of the latter. On the next apply, the aws_route53_zone.arz_test will update its state with the changes that were made in the previous run.

But inversely, it manages to ignore the drift as well, which is technically wrong (or misleading) as well, IMO, in that context.

I don't think I follow you on this bit, unfortunately. Would you mind expanding a bit on what you mean by that if the following doesn't help clear it up?

The aws_route53_records documentation has an example of how the allow_overwrite argument can be used. To help visualize it a bit here, I've copied over the example:

resource "aws_route53_zone" "example" {
  name = "test.example.com"
}

resource "aws_route53_record" "example" {
  allow_overwrite = true
  name            = "test.example.com"
  ttl             = 172800
  type            = "NS"
  zone_id         = aws_route53_zone.example.zone_id

  records = [
    aws_route53_zone.example.name_servers[0],
    aws_route53_zone.example.name_servers[1],
    aws_route53_zone.example.name_servers[2],
    aws_route53_zone.example.name_servers[3],
  ]
}

In this example, the aws_route53_zone will be created and the default name servers attached (as is happening in your case). When the aws_route53_record resource is created, the same name servers are used as the value for records. If allow_overwrite was not set to true in this case, errors would be thrown due to attempting to add a record that's already present.

If allow_overwrite was not an option, this configuration would require you to create the aws_route53_zone prior to adding the aws_route53_record to the configuration, then add the aws_route53_record configuration and import it into the state before any further manipulation could occur.

As far as the order of creation and working around this, rather than setting the value of aws_route53_record.test_sub_ns.records to aws_route53_zone.arz_test.name_servers, could you use aws_route53_record.test_wl_ns.records in order to grab those records?

Alternatively, if you're looking to include the default name servers and records from either aws_route53_record.test_wl_a or aws_route53_record.test_wl_ns (or both), you could use the concat() and distinct(), functions to create a full, deduplicated list, something like:

resource "aws_route53_record" "test_sub_ns" {
  name       = aws_route53_zone.arz_test.name
  ttl        = "50"
  type       = "NS"
  zone_id    = "ZO24E812PQO08DUTIXL0"
  records    = concat(distinct(
    aws_route53_zone.arz_test.name_servers,
    aws_route53_record.test_wl_a.records,
    aws_route53_record.test_wl_ns
  ))
  # depends_on is no longer needed, since there's an implicit dependency
}

This would mean that on the first apply, you'd get all of the records you expect, but on subsequent applies, when the drift on aws_route53_zone.arz_test.name_servers is corrected, you wouldn't wind up with duplicates in your list of records being passed to aws_route53_record.test_sub_ns.

I hope this helps some, and that I've understood your requirements properly. Happy to keep chatting if I've misunderstood, or if something still seems buggy to you.

dsantanu commented 3 weeks ago

Okay, looks like I didn't actually able to make the issue very clear. Let me try again............. :)

The aws_route53_records documentation has an example of how the allow_overwrite argument can be used. To help visualize it a bit here, I've copied over the example:

First of all, that example in the doc is technically correct but mostly not useful. aws_route53_zone.example will create the hosted-zone and at the same time, it will also do the NS record-set with the default name server. Now, as far as I understand, the following aws_route53_record.example is pointless here as it's modifying the a record, with the exact same values. It's correct as an example to show how it works but totally redundant in this context. There is no drift here as the values for NS record are not actually changed. In my case, I'm doing white-label name-servers and it'd be something like this:

resource "aws_route53_record" "example" {
  allow_overwrite = true
  name            = aws_route53_zone.example.name
  ttl             = 50
  type            = "NS"
  zone_id         = aws_route53_zone.example.zone_id

  records = [
    ns1.zzzdevops.co.uk,
    ns2.zzzdevops.co.uk,
    ns3.zzzdevops.co.uk,
    ns4.zzzdevops.co.uk,
  ]
}

Here I'm leveraging allow_overwrite = true because I'm modifying an existing record with different values (i.e. aws_route53_zone.example.name_servers[0] vs. ns1.zzzdevops.co.uk). But in the example in the doc, as it's with the same value, a) it's pointless in actual scenario and b) there is no drift in the state file.

I don't think I follow you on this bit, unfortunately. Would you mind expanding a bit on what you mean by that if the following doesn't help clear it up?

Now that's brings up the next thing. As I used allow_overwrite = true to overwrite the default NS records, it successfully updated the actual record. Meaning, if I use AWS CLI or AWS console, I can see now my NS records are saying ns1.zzzdevops.co.uk, ns2.zzzdevops.co.uk (instead of the default ones, like ns-85.awsdns-10.com etc.) but if I query through TF, it still shows the default value, which I believe is coming from the state file, when it was originally created and saved (before modifying the original NS records). Here, TF should have complained and try to overwrite, as the actual values for NS records now are different from the values in state, yet it manages not to complain somehow. It doesn't matter how many times I apply and re-apply, this behavior doesn't change.

Now, this is from the actual environment (deducted some characters just for confidentiality). If I query the zone for NS-record, using AWS CLI, I get this:

$ aws route53 list-resource-record-sets \
      --hosted-zone-id ZO24E812PQO08DUTIXL0 \
      --query "ResourceRecordSets[?Name == 'zzzdevops.co.uk.' && Type == 'NS']"
[
    {
        "Name": "zzzdevops.co.uk.",
        "Type": "NS",
        "TTL": 7200,
        "ResourceRecords": [
            {
                "Value": "ns2.zzzdevops.co.uk"
            },
            {
                "Value": "ns4.zzzdevops.co.uk"
            },
            {
                "Value": "ns1.zzzdevops.co.uk"
            },
            {
                "Value": "ns3.zzzdevops.co.uk"
            }
        ]
    }
]

which is absolutely correct, as the allow_overwrite = true has gracefully overwritten the original NS-record, with my custom ones.

But if I query the very same thing from TF, I get this:

> module.local_hosted_zones["zzzdevops.co.uk"].name_servers
tolist([
  "ns-1483.awsdns-57.org",
  "ns-1969.awsdns-54.co.uk",
  "ns-85.awsdns-10.com",
  "ns-954.awsdns-55.net",
])

which are not correct anymore, as it has been changed by applying aws_route53_record.test_wl_ns resource in my original code snippet.

Did I make it any clearer? Or, am I missing something here?

dsantanu commented 2 weeks ago

Hi there! any further comment/suggestions, @justinretzolk ??

justinretzolk commented 2 weeks ago

Thanks for the follow up here @dsantanu!

First of all, that example in the doc is technically correct but mostly not useful. (omitted for brevity) It's correct as an example to show how it works but totally redundant in this context. There is no drift here as the values for NS record are not actually changed.

I see your point, and acknowledge that the example definitely is not doing exactly what you're doing in this case. My goal in mentioning it was more to discuss the overall utility of allow_overwrite and how it enabled the aws_route53_record resource to be created at the same time as the aws_route53_zone without the need for import, using a more simple configuration that didn't also modify the NS records at the same time. I apologize if it added any undue confusion or felt redundant to mention.

...if I query through TF, it still shows the default value, which I believe is coming from the state file, when it was originally created and saved (before modifying the original NS records).

After the first apply, this is what I would expect for aws_route53_zone.arz_test.name_servers. Since the other resources are dependent on the aws_route53_zone, Terraform will finish creating it and saving it to the state prior to creating any of the other resources. Once Terraform has moved on from the aws_route53_zone resource, it won't revisit it and modify the state with the new values for name_servers until the next time that terraform apply is run -- even if other resources in the configuration modify the zone in some way.

Here, TF should have complained and try to overwrite, as the actual values for NS records now are different from the values in state, yet it manages not to complain somehow. It doesn't matter how many times I apply and re-apply, this behavior doesn't change.

This, on the other hand, is interesting to me. Are you saying that no matter how many times you've run terraform apply, the value of aws_route53_zone.arz_test.name_servers in the state remains the same (the default, outdated name servers)?

I also noticed that your terraform console output seems to reference something slightly different (module.local_hosted_zones["zzzdevops.co.uk"].name_servers). Can you give me an idea of what that output definition looks like, to make sure there's no confusion?

I re-read all of our discussion thus far to make sure I wasn't getting anything confused, and wanted to touch on something I mentioned before, but was a bit incorrect on.

Looking at your initial sample configuration and considering the module call in your last reply (that I asked about just above), it seems to me that the root of the issue comes from wanting to ensure you have the correct values for the zone's name servers in order to use them in the configuration for aws_route53_record.test_sub_ns and potentially a module output. Since you're overwriting the NS record using aws_route53_record.test_wl_ns, is there a reason why you wouldn't use the records from that resource instead? E.g.:

#### Add sub-domain NS-record in the parent-zone ####
resource "aws_route53_record" "test_sub_ns" {
  name       = aws_route53_zone.arz_test.name
  ttl        = "50"
  type       = "NS"
  zone_id    = "ZO24E812PQO08DUTIXL0"
  records    = aws_route53_record.test_wl_ns.records
}

output "name_servers" {
  value = aws_route53_record.test_wl_ns.records
}
dsantanu commented 1 week ago

@justinretzolk - thanks for you reply.

This, on the other hand, is interesting to me. Are you saying that no matter how many times you've run terraform apply, the value of aws_route53_zone.arz_test.name_servers in the state remains the same (the default, outdated name servers)?

this is exactly what happening and that's where I think it's a bug in the provider. The result I pasted abaove was from a system, deployed over a week ago with at least another 50+ apply since then and the result is still the default, outdated name servers.

I got bust with som eother stuff lately, hence couldn't reply any sooner. I can try to make a lot simple code (without the use of child-modules etc.) if that helps. But I can guarantee it won't change anything at all. I'm sure, with a little modification, you can reproduce the same, using the code from my original post but let me know if you want me to provide any other outout.

-S

justinretzolk commented 1 week ago

Hey @dsantanu 👋 I appreciate the additional follow up there. I see where I was missing you now, and apologize if it felt like we were going in circles a bit -- I try to resolve anything I can whilst triaging, since it can take a bit for things to get prioritized.

Based on your follow up, I don't think we need any additional information at this time, so I'm going to remove the needs-triage label so that this can now await prioritization.

dsantanu commented 1 week ago

thanks @justinretzolk, no need for apology; it was probably me on the first place not to make it very clear. In the mean time I'm using output from aws_route53_record.test_wl_a in stead of taking directly from aws_route53_zone.arz_test as the work around. ty!