Provider produced inconsistent result after apply when interacting with remote secondary datacenters

danieleva commented 3 years ago

Terraform Version

Terraform v0.14.7 registry.terraform.io/hashicorp/consul v2.11.0 consul 1.9.4

Affected Resource(s)

Please list the resources as a list, for example:

consul_acl_policy

Reproducing the issue requires some setup. I have 2 consul datacenters, WAN federated with ACL replication enabled. The primary is in US, secondary in Asia/Pacific. There is a ~200ms latency on the WAN connection used for federation. If terraform is configured to connect to consul API on the remote datacenter, acl_policy creation fails with

consul_acl_policy.test: Creating...

Error: Provider produced inconsistent result after apply

When applying changes to consul_acl_policy.test, provider
"registry.terraform.io/hashicorp/consul" produced an unexpected new value:
Root resource was present, but now absent.

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

This fails:

provider "consul" {
    address = "secondary-dc:8500"
    datacenter = "secondary"
    token = "...."
}

resource "consul_acl_policy" "test" {
  name        = "my_policy"
  datacenter = ["secondary"]
  rules       = <<-RULE
    node_prefix "" {
      policy = "read"
    }
    RULE
}

If I force the provider to use the primary datacenter, the resource is created correctly:

provider "consul" {
    address = "secondary-dc:8500"
    datacenter = "primary"
    token = "...."
}

resource "consul_acl_policy" "test" {
  name        = "my_policy"
  datacenter = ["secondary"]
  rules       = <<-RULE
    node_prefix "" {
      policy = "read"
    }
    RULE
}

consul_acl_policy.test: Creating...
consul_acl_policy.test: Creation complete after 1s [id=d9e929b5-28e3-95ac-615f-abf1453b52a2]
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Debug logs on consul show the issue. In both cases the provider is connected to a server in the secondary datacenter When provider is configured with datacenter=secondary:

consul[17468]: 2021-03-26T14:39:15.852Z [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=4
consul[17468]: 2021-03-26T14:39:15.852Z [DEBUG] agent.server.replication.acl.policy: acl replication: local=3 remote=4
consul[17468]: 2021-03-26T14:39:15.852Z [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1
consul[17468]: 2021-03-26T14:39:15.852Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/acl/policy?dc=secondary from=127.0.0.1:41096 latency=211.389717ms  <--- Terraform request to create policy, forwarded to primary
consul[17468]: 2021-03-26T14:39:16.037Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/policy/62cf9c85-4d1b-e1b8-87b0-b37e0559a6bf?dc=secondary from=127.0.0.1:41096 error="ACL not found"
consul[17468]: 2021-03-26T14:39:16.038Z [DEBUG] agent.http: Request finished: method=GET url=/v1/acl/policy/62cf9c85-4d1b-e1b8-87b0-b37e0559a6bf?dc=secondary from=127.0.0.1:41096 latency=264.415µs  <--- Terraform request to read the policy back, response from local agent
consul[17468]: 2021-03-26T14:39:16.059Z [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1
consul[17468]: 2021-03-26T14:39:16.059Z [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates
consul[17468]: 2021-03-26T14:39:16.063Z [DEBUG] agent.server.replication.acl.policy: acl replication - upserted batch: number_upserted=1 batch_size=111
consul[17468]: 2021-03-26T14:39:16.063Z [DEBUG] agent.server.replication.acl.policy: acl replication - finished updates
consul[17468]: 2021-03-26T14:39:16.063Z [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=198

When provider is configured with datacenter=primary

consul[17468]: 2021-03-26T14:40:07.664Z [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=5
consul[17468]: 2021-03-26T14:40:07.664Z [DEBUG] agent.server.replication.acl.policy: acl replication: local=4 remote=5
consul[17468]: 2021-03-26T14:40:07.664Z [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1
consul[17468]: 2021-03-26T14:40:07.664Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/acl/policy?dc=primary from=127.0.0.1:41114 latency=208.652671ms  <--- Terraform request to create policy, forwarded to primary
consul[17468]: 2021-03-26T14:40:07.870Z [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1
consul[17468]: 2021-03-26T14:40:07.870Z [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates
consul[17468]: 2021-03-26T14:40:07.872Z [DEBUG] agent.server.replication.acl.policy: acl replication - upserted batch: number_upserted=1 batch_size=111
consul[17468]: 2021-03-26T14:40:07.872Z [DEBUG] agent.server.replication.acl.policy: acl replication - finished updates
consul[17468]: 2021-03-26T14:40:07.872Z [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=202
consul[17468]: 2021-03-26T14:40:08.059Z [DEBUG] agent.http: Request finished: method=GET url=/v1/acl/policy/76661435-a972-ffa4-eeed-3c8658b89f09?dc=primary from=127.0.0.1:41114 latency=206.995291ms  <--- Terraform request to read the policy back, forwarded to primary

In both cases the first part of the flow is identical, the behaviour changes when reading the policy back from consul

terraform provider sends a PUT to /v1/acl/policy
local consul server forwards the PUT to primary datacenter
primary datacenter creates the policy and triggers a sync to the secondary
terraform provider sends a GET to /v1/acl/policy/<policy_id>
1. if datacenter=secondary, the local agent replies, and since the replication is not completed yet, the provider gets an ACL not found error and breaks
2. if datacenter=primary, the request is forwarded to the primary and the provider completes correctly

A naive workaround, adding time.Sleep(10 * time.Second) before the return in resourceConsulACLPolicyCreate to allow for acl replication to complete fixes the problem, but I don't think that's the proper way to address this.

The provider documentation is not clear on what should be the configuration when dealing with federated datacenters. If the datacenter parameter in the provider must be configured to point at the primary, that should be explicit in the documentation, in addition of ensuring all the resources specify the datacenter they refer to if it's not the primary. IMHO a better option would be to add some retry logic in the resources, to account for delay and eventually consistent nature of ACL federation. In my tests the replication is still very fast, usually under 1s, so a configurable retry with exponential backoff would handle it nicely. If you agree on the retry solution, I'm happy to provide a PR for it.

[GH-167] partially addressed this, but didn't add any retry logic.

Thanks :)

remilapeyre commented 3 years ago

Hi @danieleva, thanks for reporting this issue. I did not test much with federated datacenters, the provider certainly behaves weirdly in this cases and is probably not coherent for each resource. I will have a look in the coming days to find what is the best way to proceed, the retry solution looks appropriate for ACLs but I would like to make sure it is.

rrijkse commented 3 years ago

@remilapeyre Any updates on this? It is still an issue with the latest version of Terraform/Consul provider.

remilapeyre commented 3 years ago

Hi @rrijkse, I made some tests and found the way I wanted to implement this. I will come back to this issue and try to fix all the resources that currently show this behaviour

erisnar commented 2 years ago

We experienced the same issue and solved it by configuring the provider to the primary datacenter.

next-jesusmanuelnavarro commented 1 year ago

I also found what seems a related behaviour when creating intentions on a federated secondary datacenter. Terraform successfully creates intentions when pointing to the primary but fails when pointing to secondary.

Note the intention is in fact created.

2023-06-28T15:42:37.308+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 17 2023-06-28T15:42:37.308+0200 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate.d/[redacted] 2023-06-28T15:42:37.315+0200 [ERROR] vertex "consul_config_entry.intentions[\"[redacted]\"]" error: failed to read config entry after setting it.

terraform apply to the secondary fails the first time (while the intention is in fact created) and apply successes when applied for a second time.

Given the error provided, "error: failed to read config entry after setting it." it seems a workaround may be catch that error and reattempt some few cycles with increasing waiting time (i.e., 1 sec, then 2, then 4, then 8) before finally giving up.

jmnavarrol commented 1 year ago

Hi @rrijkse, I made some tests and found the way I wanted to implement this. I will come back to this issue and try to fix all the resources that currently show this behaviour

Hi @remilapeyre: did you manage to advance on this issue?

At least you might apply @danieleva 's suggested workaround _"A naive workaround, adding time.Sleep(10 * time.Second) before the return in resourceConsulACLPolicyCreate to allow for acl replication to complete fixes the problem"_ (quite possibly a lower wait time would do the trick as I also saw time in the 1~3 seconds range for replication) till you find the time/inspiration for a better solution.

TIA

7fELF commented 11 months ago

I opened a PR to fix this. @remilapeyre can you take a look? https://github.com/hashicorp/terraform-provider-consul/pull/385

7fELF commented 11 months ago

I opened a PR to fix this. @remilapeyre can you take a look? #385

I published the patched version to the registry to make it easier to validate/test: https://registry.terraform.io/providers/7fELF/consul/latest

next-jesusmanuelnavarro commented 11 months ago

I opened a PR to fix this. @remilapeyre can you take a look? #385

I published the patched version to the registry to make it easier to validate/test: https://registry.terraform.io/providers/7fELF/consul/latest

I could test it today with following versions' definition:

terraform {
  required_version = "= 1.4.6"

  required_providers {
    consul = {
      source  = "7fELF/consul"
      version = "= 2.20.1"
    }
    null   = "= 3.2.1"
  }
}

I can still reproduce the bug: upon first terraform apply I get the following error:

consul_config_entry.intentions["REDACTED"]: Creating...
╷
│ Error: failed to read config entry after setting it.
│ This may happen when some attributes have an unexpected value.
│ Read the documentation at https://www.consul.io/docs/agent/config-entries/service-intentions.html
│ to see what values are expected
│ 
│   with consul_config_entry.intentions["REDACTED"],
│   on main.tf line 30, in resource "consul_config_entry" "intentions":
│   30: resource "consul_config_entry" "intentions" {
│ 
╵

The intention is nevertheless properly created and I can see it on Consul webui. A second terraform apply finishes successfully and terraform destroy works as expected on first run.

This is exactly the same behaviour I got with consul = "= 2.17.0".

Also relevant code on main.tf (Consul access variables from shell environment pointing to a remote secondary datacenter):

# Loops through intentions
resource "consul_config_entry" "intentions" {
  for_each = {
    for intention in local.intentions:
      intention.name => intention
  }

  name = each.value.name
  kind = "service-intentions"

  config_json = jsonencode({
    Sources = [
      for source in each.value.sources: {
        Name       = source
        Type       = "consul"
        Action     = "allow"
        Namespace  = "default"
        Partition  = "default"
        Precedence = 9
      }
    ]
  })
}

7fELF commented 11 months ago

Thanks for testing my patch @next-jesusmanuelnavarro My patch currently fixes the following resources:

auth_method
binding_rule
policy
role
role_policy_attachment
token
token_policy_attachment
token_role_attachment

I'm not a service mesh user, but according to the docs, setting a replication token also enables service mesh data replication.

So to also fix it, I need to figure out:

Which resources (referred to as "service mesh data") are replicated
Which replication index each of those resources increases:

(redacted)@(redacted):~$ curl http://localhost:8500/v1/acl/replication | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   353  100   353    0     0   199k      0 --:--:-- --:--:-- --:--:--  344k
{
  "Enabled": true,
  "Running": true,
  "SourceDatacenter": "(redacted)-preprod",
  "ReplicationType": "tokens",
  "ReplicatedIndex": 133549002,
  "ReplicatedRoleIndex": 133549003,
  "ReplicatedTokenIndex": 133547011,
  "LastSuccess": "2023-12-11T12:51:53Z",
  "LastError": "2023-12-07T12:52:24Z",
  "LastErrorMessage": "failed to retrieve remote ACL tokens: rpc error making call: ACL not found"
}

next-jesusmanuelnavarro commented 11 months ago

Thanks for testing my patch @next-jesusmanuelnavarro My patch currently fixes the following resources: So to also fix it, I need to figure out:

Which resources (referred to as "service mesh data") are replicated

Which replication index each of those resources increases:

On this, I can be of little help as I don't admin my Consul cluster, I'm just a user of it (in fact, I can't even list policies with my credentials).

All I can say, if that's what you mean, is my use case is for service-intentions, service-defaults, service-resolver and service-splitters. https://developer.hashicorp.com/consul/docs/connect/config-entries/service-intentions

remilapeyre commented 11 months ago

Hi, this is a long standing issue and the patch from @7fELF looks like the right way forward to fix this. I wish this could be handled automatically by the Consul Go client but we should move forward with the current approach first, and improve the situation for all users of the Go client later. Regarding the inconsistency with the config entry, I'm not sure the same fix is applicable but will look into that as wel.

danihuerta commented 8 months ago

Any update in this? I'm experiencing the same issue in my Federated Clusters when pointing to the Secondary DC. The Fix will be applied into the new release?

hashicorp / terraform-provider-consul

Provider produced inconsistent result after apply when interacting with remote secondary datacenters #249

Terraform Version

Affected Resource(s)