Fail to apply changes because of wrong resource dependency order

mm-chia commented 1 month ago

Terraform Version

Terraform v1.9.3
on darwin_arm64
+ provider registry.terraform.io/hashicorp/aws v5.64.0

Terraform Configuration Files

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

locals {
  vpc_cidr = "10.0.0.0/16"
  cloudmap_namespace_name = "app.local"
  unique_suffix = "9d25709fbcea"

  canaries = {
    "v5" = {
      route_weight = 100
    },
//    "v6" = {
//     route_weight = 0
//    }
  }
}

resource "aws_vpc" "main" {
  cidr_block           = local.vpc_cidr
  enable_dns_support   = true
  enable_dns_hostnames = true
}

resource "aws_service_discovery_private_dns_namespace" "namespace" {
  name = local.cloudmap_namespace_name
  vpc  = aws_vpc.main.id
}

resource "aws_appmesh_mesh" "mesh" {
  name = "${terraform.workspace}-mesh-${local.unique_suffix}"
}

resource "aws_appmesh_virtual_node" "canary" {
  for_each = local.canaries

  name      = "VirtualNode${each.key}-${local.unique_suffix}"
  mesh_name = aws_appmesh_mesh.mesh.name

  spec {
    service_discovery {
      aws_cloud_map {
        namespace_name = aws_service_discovery_private_dns_namespace.namespace.name
        service_name   = aws_service_discovery_service.canary[each.key].name
      }
    }

    listener {
      port_mapping {
        port     = 80
        protocol = "http"
      }
    }
  }
}

resource "aws_service_discovery_service" "canary" {
  for_each = local.canaries

  name        = "app-${each.key}-discovery-${local.unique_suffix}"
  description = "Service based on a private DNS namespace"
  dns_config {
    dns_records {
      type = "A"
      ttl  = 60
    }
    routing_policy = "MULTIVALUE"
    namespace_id   = aws_service_discovery_private_dns_namespace.namespace.id
  }
}

resource "aws_appmesh_virtual_router" "router" {
  name      = "VirtualRouter-${local.unique_suffix}"
  mesh_name = aws_appmesh_mesh.mesh.name

  spec {
    listener {
      port_mapping {
        port     = 80
        protocol = "http"
      }
    }
  }
}

resource "aws_appmesh_route" "app_route" {
  name                = "AppMeshRoute-${local.unique_suffix}"
  mesh_name           = aws_appmesh_mesh.mesh.name
  virtual_router_name = aws_appmesh_virtual_router.router.name

  spec {
    http_route {
      action {
        # Use a dynamic block to create weighted_target entries for each canary
        dynamic "weighted_target" {
          for_each = local.canaries

          content {
            virtual_node = aws_appmesh_virtual_node.canary[weighted_target.key].name
            weight       = weighted_target.value.route_weight
          }
        }
      }
      match {
        prefix = "/"
      }
    }
  }
}

aws_appmesh_virtual_node.canary["v6"]: Destroying... [id=1f9a30c5-d5cb-429d-a480-31d337738c7e]
╷
│ Error: deleting App Mesh Virtual Node (1f9a30c5-d5cb-429d-a480-31d337738c7e): operation error App Mesh: DeleteVirtualNode, https response error StatusCode: 409, RequestID: 13dfdbbd-5a76-4170-898e-f95e43759894, ResourceInUseException: VirtualNode with name VirtualNodev6-9d25709fbcea cannot be deleted because it is the target of one or more routes.

Debug Output

https://gist.github.com/mm-chia/ac7c178acbf4cca39377bab987191c52

Expected Behavior

Terraform should update the aws_appmesh_route.app_route resource first, then delete the aws_appmesh_virtual_node.canary["v6"] resource, so it can complete the operation successfully.

Actual Behavior

Terraform tries to delete aws_appmesh_virtual_node.canary["v6"] before updating the aws_appmesh_route.app_route resource. This results in failure, since the virtual_node cannot be deleted while it is still referenced in the appmesh route.

Steps to Reproduce

Run terraform init
Run terraform apply with the above code with both v5 and v6 properties enabled.
Then comment out the v6 property and re-run terraform apply again.

Additional Context

You can run the code from this repository: https://github.com/test-org-tofu2/app-mesh-test

References

No response

princejha95 commented 1 month ago

@mm-chia Can i pick this up ?

mm-chia commented 1 month ago

@mm-chia Can i pick this up ?

@princejha95 sure, thanks!

jbardin commented 1 month ago

Hi @mm-chia,

Terraform is applying the changes correctly according the configuration you are using. The default ordering is to complete destroy operations before create or update operations. If you want the inverse ordering for dependent operations, you need to use the create_before_destroy lifecycle option.

We use GitHub issues for tracking bugs and enhancements, rather than for questions. While we can sometimes help with certain simple problems here, it's better to use the community forum where there are more people ready to help.

Thanks!

mm-chia commented 1 month ago

Hi @mm-chia,

Terraform is applying the changes correctly according the configuration you are using. The default ordering is to complete destroy operations before create or update operations. If you want the inverse ordering for dependent operations, you need to use the create_before_destroy lifecycle option.

We use GitHub issues for tracking bugs and enhancements, rather than for questions. While we can sometimes help with certain simple problems here, it's better to use the community forum where there are more people ready to help.

Thanks!

Hi @jbardin ,

I did try to use the lifecycle option create_before_destroy, but it doesn't change anything. That's only in the context of creating that particular resource, but doesn't have to do with dependencies.

There are 2 scenarios here:

The resource aws_appmesh_route should be updated after a new aws_appmesh_virtual_node resource is created or updated - this scenario works fine
When a resource aws_appmesh_virtual_node is deleted, the resource aws_appmesh_route should be updated FIRST, and then the virtual_node should be deleted -> this scenario does NOT work correctly

If you followed the steps that I gave to reproduce the issue, you'd know what I am talking about.. but clearly, you did not. Could you please try that, and let me know how it goes for you?

mm-chia commented 1 month ago

I think this issue should stay open until one of this happens:

someone provides some steps for this scenario to work with current Terraform code/options
or it is fixed in the Terraform code (the dependencies should be calculated correctly)

mm-chia commented 1 month ago

BTW, working_as_designed is not acceptable. If the behaviour is buggy, it should be fixed..

jbardin commented 1 month ago

@mm-chia,

I don't have AWS credentials to test the given resources, but a minimal example using built-in resources works as expected:

variable "canaries" {
  type = set(string)
}

resource "terraform_data" "canary" {
  for_each = var.canaries
  lifecycle {
    create_before_destroy = true
  }
}

resource "terraform_data" "app_route" {
  input = terraform_data.canary
}

# create the resources with a single "canary"
%  terraform apply -var 'canaries=["v5"]' -auto-approve
...
# remove the "canary" resource
% terraform apply -var 'canaries=[]' -auto-approve
...
terraform_data.app_route: Modifying... [id=12f60287-2700-778c-008c-d4b17931b593]
terraform_data.app_route: Modifications complete after 0s [id=12f60287-2700-778c-008c-d4b17931b593]
terraform_data.canary["v5"]: Destroying... [id=8b0b40fe-d1f2-4de6-7b49-fc7b70179e32]
terraform_data.canary["v5"]: Destruction complete after 0s

Which modifies the app_route instance before destroying the canary instance. The order of operations in your original example are correct, Terraform is following exactly what is laid out in the configuration. Terraform does not know that the app_route has stored an internal reference to the canary in some way and must be updated before the canary instance can be removed.

mm-chia commented 1 month ago

Hi @jbardin

I'm sorry, but this is not a valid example, you will actually need to use some valid AWS credentials to run that scenario. The reason is that you used terrafrom_data resource, which doesn't have any constraints in how it is created/deleted.

There are some constraints regarding aws_appmesh_virtual_node and aws_appmesh_route resources:

the aws_appmesh_virtual_node should be created/updated first, then aws_appmesh_route can be created/updated
when deleting a aws_appmesh_virtual_node, it should be removed from aws_appmesh_route FIRST -> and this is the issue we are having

Could you please try the steps I provided on a real AWS account, and reproduce the issue by yourself? I can't explain any better than that, only if you reproduce it yourself you could understand what's happening.

best regards,

mm-chia commented 1 month ago

BTW, I already tried the lifecycle option create_before_destroy yesterday, I spent a few hours trying to figure this out, with no success. So that route is a dead end, for sure. It is unrelated to the issue we are having. We have a dependency between 2 different resources..

jbardin commented 1 month ago

Terraform has no knowledge of what resources are, all resource types from all providers are handled with the same resource lifecycle, so the replacement resources here don't change anything. This follows the exact same ordering as described, the proxy for the aws_appmesh_virtual_node.canary would be created first, then aws_appmesh_route.app_route would be created after with the referenced data from the canary. When deleting the aws_appmesh_virtual_node, the aws_appmesh_route will be updated before the delete operation.

Nothing here looked too onerous, so I used my personal AWS account, and successfully applied using your instructions (with the addition of create_before_destroy of course):

aws_appmesh_route.app_route: Modifying... [id=362164c4-2791-4bd4-8cda-0a61aa74a366]
aws_appmesh_route.app_route: Modifications complete after 0s [id=362164c4-2791-4bd4-8cda-0a61aa74a366]
aws_appmesh_virtual_node.canary["v6"]: Destroying... [id=9d39a38a-dbc0-4309-af5c-ba68685f7c95]
aws_appmesh_virtual_node.canary["v6"]: Destruction complete after 0s
aws_service_discovery_service.canary["v6"]: Destroying... [id=srv-s6pqsu3glrv2aeha]
aws_service_discovery_service.canary["v6"]: Destruction complete after 1s

mm-chia commented 1 month ago

Could you please share the exact code you used? maybe create a PR on that repo I shared above? I'd really like to see how you solved this.. since I have tried the create_before_destroy option already, and it didn't do anything different

mm-chia commented 1 month ago

@jbardin I added you as collaborator on https://github.com/test-org-tofu2/app-mesh-test If you could please push your changes to a branch, I would greatly appreciate it! Or post here the code change you did, to make that scenario work.

Keep in mind, that first you need to run the code with:

canaries = {
    "v5" = {
      route_weight = 100
    },
    "v6" = {
     route_weight = 0
    }
  }

and then use:

  canaries = {
    "v5" = {
      route_weight = 100
    }
  }

Only this way you can reproduce the problem

jbardin commented 1 month ago

yes, that is exactly what I did. The only initial change made was to add create_before_destroy

diff --git a/main.tf b/main.tf
index 219ce8a..1aad9b4 100644
--- a/main.tf
+++ b/main.tf
@@ -59,6 +59,9 @@ resource "aws_appmesh_virtual_node" "canary" {
     }
   }

+  lifecycle {
+    create_before_destroy = true
+  }
 }

mm-chia commented 1 month ago

Apparently, the lifecycle create_before_destroy option DOES make a difference. But it has to be set from the beginning, when the resources are created, not afterwards - from what I could tell. I did have one issue VirtualNode with name VirtualNodev4-9d25709fbcea cannot be deleted because it is the target of one or more routes. but after I re-ran the apply, it worked (I think it was a timing issue). So we can keep this issue closed :) Thanks a lot for your time!

hashicorp / terraform