Open danieleva opened 3 years ago
Hi @danieleva, thanks for reporting this issue. I did not test much with federated datacenters, the provider certainly behaves weirdly in this cases and is probably not coherent for each resource. I will have a look in the coming days to find what is the best way to proceed, the retry solution looks appropriate for ACLs but I would like to make sure it is.
@remilapeyre Any updates on this? It is still an issue with the latest version of Terraform/Consul provider.
Hi @rrijkse, I made some tests and found the way I wanted to implement this. I will come back to this issue and try to fix all the resources that currently show this behaviour
We experienced the same issue and solved it by configuring the provider to the primary datacenter.
I also found what seems a related behaviour when creating intentions on a federated secondary datacenter. Terraform successfully creates intentions when pointing to the primary but fails when pointing to secondary.
Note the intention is in fact created.
2023-06-28T15:42:37.308+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 17 2023-06-28T15:42:37.308+0200 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate.d/[redacted] 2023-06-28T15:42:37.315+0200 [ERROR] vertex "consul_config_entry.intentions[\"[redacted]\"]" error: failed to read config entry after setting it.
terraform apply
to the secondary fails the first time (while the intention is in fact created) and apply successes when applied for a second time.
Given the error provided, "error: failed to read config entry after setting it." it seems a workaround may be catch that error and reattempt some few cycles with increasing waiting time (i.e., 1 sec, then 2, then 4, then 8) before finally giving up.
Hi @rrijkse, I made some tests and found the way I wanted to implement this. I will come back to this issue and try to fix all the resources that currently show this behaviour
Hi @remilapeyre: did you manage to advance on this issue?
At least you might apply @danieleva 's suggested workaround _"A naive workaround, adding time.Sleep(10 * time.Second) before the return in resourceConsulACLPolicyCreate to allow for acl replication to complete fixes the problem"_ (quite possibly a lower wait time would do the trick as I also saw time in the 1~3 seconds range for replication) till you find the time/inspiration for a better solution.
TIA
I opened a PR to fix this. @remilapeyre can you take a look? https://github.com/hashicorp/terraform-provider-consul/pull/385
I opened a PR to fix this. @remilapeyre can you take a look? #385
I published the patched version to the registry to make it easier to validate/test: https://registry.terraform.io/providers/7fELF/consul/latest
I opened a PR to fix this. @remilapeyre can you take a look? #385
I published the patched version to the registry to make it easier to validate/test: https://registry.terraform.io/providers/7fELF/consul/latest
I could test it today with following versions' definition:
terraform {
required_version = "= 1.4.6"
required_providers {
consul = {
source = "7fELF/consul"
version = "= 2.20.1"
}
null = "= 3.2.1"
}
}
I can still reproduce the bug: upon first terraform apply
I get the following error:
consul_config_entry.intentions["REDACTED"]: Creating...
╷
│ Error: failed to read config entry after setting it.
│ This may happen when some attributes have an unexpected value.
│ Read the documentation at https://www.consul.io/docs/agent/config-entries/service-intentions.html
│ to see what values are expected
│
│ with consul_config_entry.intentions["REDACTED"],
│ on main.tf line 30, in resource "consul_config_entry" "intentions":
│ 30: resource "consul_config_entry" "intentions" {
│
╵
The intention is nevertheless properly created and I can see it on Consul webui. A second terraform apply
finishes successfully and terraform destroy
works as expected on first run.
This is exactly the same behaviour I got with consul = "= 2.17.0".
Also relevant code on main.tf (Consul access variables from shell environment pointing to a remote secondary datacenter):
# Loops through intentions
resource "consul_config_entry" "intentions" {
for_each = {
for intention in local.intentions:
intention.name => intention
}
name = each.value.name
kind = "service-intentions"
config_json = jsonencode({
Sources = [
for source in each.value.sources: {
Name = source
Type = "consul"
Action = "allow"
Namespace = "default"
Partition = "default"
Precedence = 9
}
]
})
}
Thanks for testing my patch @next-jesusmanuelnavarro My patch currently fixes the following resources:
I'm not a service mesh user, but according to the docs, setting a replication token also enables service mesh data replication.
So to also fix it, I need to figure out:
(redacted)@(redacted):~$ curl http://localhost:8500/v1/acl/replication | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 353 100 353 0 0 199k 0 --:--:-- --:--:-- --:--:-- 344k
{
"Enabled": true,
"Running": true,
"SourceDatacenter": "(redacted)-preprod",
"ReplicationType": "tokens",
"ReplicatedIndex": 133549002,
"ReplicatedRoleIndex": 133549003,
"ReplicatedTokenIndex": 133547011,
"LastSuccess": "2023-12-11T12:51:53Z",
"LastError": "2023-12-07T12:52:24Z",
"LastErrorMessage": "failed to retrieve remote ACL tokens: rpc error making call: ACL not found"
}
Thanks for testing my patch @next-jesusmanuelnavarro My patch currently fixes the following resources: So to also fix it, I need to figure out:
- Which resources (referred to as "service mesh data") are replicated
- Which replication index each of those resources increases:
On this, I can be of little help as I don't admin my Consul cluster, I'm just a user of it (in fact, I can't even list policies with my credentials).
All I can say, if that's what you mean, is my use case is for service-intentions, service-defaults, service-resolver and service-splitters. https://developer.hashicorp.com/consul/docs/connect/config-entries/service-intentions
Hi, this is a long standing issue and the patch from @7fELF looks like the right way forward to fix this. I wish this could be handled automatically by the Consul Go client but we should move forward with the current approach first, and improve the situation for all users of the Go client later. Regarding the inconsistency with the config entry, I'm not sure the same fix is applicable but will look into that as wel.
Any update in this? I'm experiencing the same issue in my Federated Clusters when pointing to the Secondary DC. The Fix will be applied into the new release?
Terraform Version
Terraform v0.14.7 registry.terraform.io/hashicorp/consul v2.11.0 consul 1.9.4
Affected Resource(s)
Please list the resources as a list, for example:
Reproducing the issue requires some setup. I have 2 consul datacenters, WAN federated with ACL replication enabled. The primary is in US, secondary in Asia/Pacific. There is a ~200ms latency on the WAN connection used for federation. If terraform is configured to connect to consul API on the remote datacenter, acl_policy creation fails with
This fails:
If I force the provider to use the primary datacenter, the resource is created correctly:
Debug logs on consul show the issue. In both cases the provider is connected to a server in the secondary datacenter When provider is configured with
datacenter=secondary
:When provider is configured with
datacenter=primary
In both cases the first part of the flow is identical, the behaviour changes when reading the policy back from consul
/v1/acl/policy
/v1/acl/policy/<policy_id>
ACL not found
error and breaksA naive workaround, adding
time.Sleep(10 * time.Second)
before the return in resourceConsulACLPolicyCreate to allow for acl replication to complete fixes the problem, but I don't think that's the proper way to address this.The provider documentation is not clear on what should be the configuration when dealing with federated datacenters. If the datacenter parameter in the provider must be configured to point at the primary, that should be explicit in the documentation, in addition of ensuring all the resources specify the datacenter they refer to if it's not the primary. IMHO a better option would be to add some retry logic in the resources, to account for delay and eventually consistent nature of ACL federation. In my tests the replication is still very fast, usually under 1s, so a configurable retry with exponential backoff would handle it nicely. If you agree on the retry solution, I'm happy to provide a PR for it.
[GH-167] partially addressed this, but didn't add any retry logic.
Thanks :)