Closed daniel-at-matt3r closed 3 months ago
issue resolved. i missed enable_token_replication = true in consul acl configuration.
solved, see above
Glad to hear you got it figured out @daniel-at-matt3r! And thank you for reporting how you fixed your issue -- that's very helpful when other people search issues.
Hi @daniel-at-matt3r, I have a similar setup. Did you set 'enable_token_replication = true' in your primary as well as your secondary consul cluster? And which version of consul are you running, anyway?
Hi @neuroserve , Added to both primary and secondary DCes. But i think no need to have it in primary, logically primary designation should allow local token creation, can be wrong tho. @tgross, correct me if im on wrong direction. Just side notice - getting very robust nomad cluster, looking toward consul mesh gateway, very interesting tech.
In the tutorial https://developer.hashicorp.com/consul/docs/security/acl/acl-federated-datacenters?productSlug=consul&tutorialSlug=security-operations&tutorialSlug=access-control-replication-multiple-datacenters enable_token_replication = true
is only set in the primary consul cluster.
I cannot set it in secondaries after the datacenters have been joined. And I think I wasn't able to join as long it was set in the secondaries. Can you tell me, what consul version you are using?
I have four consul datacenters federated with two mesh gateways each (with four nomad clusters "on top"). I detected that change in behaviour when I tried to configure auto_config
in one of the secondaries (s. https://discuss.hashicorp.com/t/auto-config-vs-acl-token-replication/67612).
@daniel-at-matt3r Could you please still give us the consul version, you are using?
@neuroserve , sorry missed you message. We are on consul 1.19.1 and Nomad 1.8.0
I'll try and set that up, too with "cross replication" enabled.
I cannot set up federated consul clusters with version 1.19.1 with
enable_token_replication = true
enabled in the primary and the secondary cluster. Neither setting it from the start nor setting one after the other works. Only setting it in the primary (and setting the replication token in the secondary) gives me a stable replication.
Tokens with global scope are replicated and can be created on the primary as well as the secondary.
I get no token-error, when I try to run the above job. The tokens are created in consul, but with local scope. Now I need a test jobfile, that deploys services in both datacenters of the region which try to communicate via consul connect.
I can run the countdash-job
job "countdash" {
datacenters = ["prod1"]
node_pool = "static-clients"
group "api" {
network {
mode = "bridge"
}
service {
name = "count-api"
port = "9001"
connect {
sidecar_service {}
}
}
task "web" {
driver = "docker"
config {
image = "hashicorpdev/counter-api:v3"
}
}
}
group "dashboard" {
network {
mode = "bridge"
port "http" {
static = 9002
to = 9002
}
}
service {
name = "count-dashboard"
port = "http"
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "count-api"
local_bind_port = 8080
}
}
}
}
}
task "dashboard" {
driver = "docker"
env {
COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
}
config {
image = "hashicorpdev/counter-dashboard:v3"
}
}
}
}
in the nomad "above" the primary consul datacenter without problems. As soon as I change the datacenter to one of the secondaries, I can no longer start it (the envoy proxy cannot be started). Error message is:
Task hook failed: envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>
Envoy logs this:
failed fetch proxy config from local agent: Unexpected response code: 403 (Permission denied: token with AccessorID 'primary-dc-down' lacks permission 'service:read' on "count-api-sidecar-proxy")
I guess, that is due to the generated SI tokens being "local" and not global and are therefore not replicated by consuls ACL replication.
Am I missing something or is it no longer possible to run service mesh protected services within nomad and seperate cloud providers? Can someone explain how https://github.com/hashicorp/consul/issues/7381 was resolved? Do I have to change my config somewhere?
Nomad version
1.8.0
Operating system and Environment details
All nodes are Debian 12. Two federated consul clusters running in different cloud providers. ACL enabled, replication enabled and works fine. Two federated nomad clusters running in different cloud providers.
Issue
When submitting nomad job with consul-connect service using the following command: "nomad job run -region=gcp-us-west1 clearml-redis.hcl" on the nomad cluster connected to non-primary consul cluster job is failed with error: "Task hook failed: consul_si_token: Unexpected response code: 500 (rpc error making call: Local tokens are disabled)"
Reproduction steps
issue can be reproduced using consul cli: root@nomad-generic-clients-5lsz:~# consul acl token create -local -policy-name "consul-server-policy" Failed to create new token: Unexpected response code: 500 (rpc error making call: Local tokens are disabled)
Expected Result
Nomad job is running
Actual Result
Nomad job failed to start because nomad is trying to obtain Local token on non-primary federated consul cluster
Job file (if appropriate)
variable "nomad-namespace" { type = string description = "Nomad namespace to deploy envoy proxy to" default = "default" }
job "clearml-redis" { datacenters = ["*"] namespace = "${var.nomad-namespace}"
constraint { attribute = "${node.class}" operator = "regexp" value = "generic" }
group "clearml-redis" {
} }
Based on my research nomad is requesting Local token since version 1.3.0 - https://support.hashicorp.com/hc/en-us/articles/17767098217235-How-Nomad-Manages-ACL-Tokens-Polices-for-Consul-Service-Mesh
But how should it obtain token, which is Local, on non-primary cluster ?