hashicorp / terraform-provider-consul

Terraform Consul provider
https://www.terraform.io/docs/providers/consul/
Mozilla Public License 2.0
124 stars 112 forks source link

resource.consul_service "async" changes after creation #347

Open marceloboeira opened 1 year ago

marceloboeira commented 1 year ago

I'm not 100% sure if that's to TF providers fault or simply "the way consul works" but, almost every time I create a consul service (with checks) after the terraform apply, the next terraform plan includes a change with the service check information. Even thought it was already "published" to consul in the first plan/apply setup.

Terraform Version

Terraform v1.5.1 (but it's also problematic on 1.3.x, 1.4.x)
on darwin_amd64

Affected Resource(s)

Terraform Configuration Files

resource "consul_service" "service" {
  service_id = "example"
  name       = "cache-foo"
  node       = "cache-foo-node"
  address    = "10.9.9.99"
  port       = 6379

  check {
    check_id = "service:${var.name}"
    name     = "cache-foo"
    notes    = "Service check for service:cache-foo"
    ...
  }
}

Expected Behavior

Nothing should show up after plan/apply since the service check and everything service itself should've been created with the above code.

Actual Behavior

After the first plan/apply (possibly due to some async process on consul's side?) the next terraform plan shows:

 resource "consul_service" "service" {
       ....
+        check {
+           check_id                          = "service:cache-foo"
+           deregister_critical_service_after = "30s"
+           interval                          = "30s"
...
        }
    }

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform plan
  2. terraform apply
  3. wait a few minutes to be sure
  4. terraform plan (without any .TF code change)
  5. See weird "already applied" changes

Important Factoids

What I'm unsure of is if this:

Checking the code for the create part, I don't see any major issues:

https://github.com/hashicorp/terraform-provider-consul/blob/9c5772f607ad26325c6bab96917fb41f875dd621/consul/resource_consul_service.go#L234-L253

Then checking how it is read also, nothing big other than it relies on those values being there in the first place:

https://github.com/hashicorp/terraform-provider-consul/blob/9c5772f607ad26325c6bab96917fb41f875dd621/consul/resource_consul_service.go#L271C1-L344

My money would be on service.Checks being empty in the first "read" during the apply but populated later on further reads:

https://github.com/hashicorp/terraform-provider-consul/blob/9c5772f607ad26325c6bab96917fb41f875dd621/consul/resource_consul_service.go#L302

Finally, what leads me to believe it is a consul "problem" is that the tests do not have this issue. Possibly, a slight delay on replicating and different nodes being the ones to receive the "write" vs "read" requests could. The weird part is why would the service itself be replicated but not the service check...

If that is the case, is there anything specific that can be done to perhaps reduce the likelihood of that happening?

remilapeyre commented 11 months ago

Hello @marceloboeira, thanks for the detailed write up. As you mentioned this situation does not happens in the tests or in a single node cluster. The tests are also a bit peculiar here as none of them test an actual running service.

It is possible that the diff occurs after the Consul agent on the node running the service updates the check in Consul for the first time, which would be an async operation happening after the service is registered in the Consul catalog.

The diff is probably benign but we may be able to use a diff suppress function to hide the changes when this happens, if we can detect it reliably (we wouldn't want to hide actual changes by mistake).

I will make additional tests on my end, can you please post the complete diff if it happens again to you? It would help to understand what attributes are changing.