Open crizstian opened 4 years ago
https://github.com/hashicorp/consul/issues/7906
Was what we found - we don't have TLS enabled but the behavior you describe is the same. Its a bit strange as Consul "looks" like it's working but it isn't without the replication token set in the agent.
Does this help you resolve the issue @Crizstian ?
@idrennanvmware for didn't work or maybe I am missing something I am still getting errors after federation with ACL enabled
2020-05-20T16:38:54.209Z [WARN] agent: Coordinate update blocked by ACLs: accessorID=
2020-05-20T16:38:54.230Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=172.20.20.21:43596 error="ACL not found"
2020-05-20T16:38:54.230Z [DEBUG] agent.http: Request finished: method=GET url=/v1/kv/vault/core/lock from=172.20.20.21:43596 latency=399.873µs
2020-05-20T16:38:54.234Z [ERROR] agent.http: Request error: method=GET url=/v1/catalog/service/vault?stale= from=172.20.20.21:43596 error="ACL not found"
2020-05-20T16:38:54.235Z [DEBUG] agent.http: Request finished: method=GET url=/v1/catalog/service/vault?stale= from=172.20.20.21:43596 latency=2.821513ms
2020-05-20T16:38:54.236Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/register from=172.20.20.21:43596 error="ACL not found"
2020-05-20T16:38:54.237Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/agent/service/register from=172.20.20.21:43596 latency=205.153µs
Steps I followed are:
1 .- spin up two datacenters 2.- bootstap ACL system in both dcs 3.- update secondary dc config, setting primary_dc value 4.- create acl replica policy as following the documentation 5.- set replica agent token in secondary dc
Previously I had done
1.- spin up 2 datacenters
2.- automatically federate clusters with retry_join_wan
3.- bootstrap dc1
4.- set dc1 root token as agent token for dc2
but same failures I am getting, probably I am missing something silly or it consul that is not working with federation wit acl enabled.
Hi @Crizstian - let me give our our steps (we did NOT have success in your approach of bootstrapping consul in 2 places).
Set up Consul (and bootstrap on primary cluster). At the end of this you should have ACL tokens, and the replication token set to the value of your token with the right permissions. Lets call this "primary-cluster". You will need to have ensured that the following are set : "enable_central_service_config": true "tokens": {
Set up consul on cluster 2 BUT this time make sure that your gossip key AND your acl token match the output from step 1 AND make sure that you set "primary_datacenter" in your config to the cluster you set up in step 1. So in our case "cluster_name" is gateway-cluster, and primary datacenter is "primary-cluster"
"enable_central_service_config": true "tokens": {
That's the way we have been able to reliably set ours up.
The only difference between what you are doing and we are doing is (as I understand it) that we don't have TLS on agents set up yet in the scenarios here
EDIT: Don't forget to set your intentions in Consul as well (I do star->star for testing)
@Crizstian The above is as far as we've gotten. we've had ZERO success having a service in one DC talk to another DC (example CountDash Dashboard in DC1 talk to CountDash API in DC2.
@tgross - do you have any example jobs for Mesh Gateway testing? Rule out PBKEC on this end :)
edit: I've been working on this all day and gotten no where. I still suspect something isn't quite right either on the federation side or the mesh side but I ended up in a scenario I coudn't even unregister the service in the secondary cluster.
i have made some progress I have to do the following:
1.- spin up my cluster 1 (called sfo) 2.- bootstrap acl in cluster 1 3.- set the default and replication token; for simplicity I just set the consul root token for both
this is my config file for any consul server
data_dir = "/var/consul/config/"
log_level = "DEBUG"
datacenter = "{{ env "DATACENTER" }}"
primary_datacenter = "{{ env "PRIMARY_DATACENTER" }}"
ui = true
server = true
bootstrap_expect = {{ env "CONSUL_SERVERS" }}
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
ports {
grpc = 8502
https = {{ if eq (env "CONSUL_SSL") "true" }}{{ env "CONSUL_PORT" }}{{ else }}-1{{end}}
http = {{ if eq (env "CONSUL_SSL") "true" }}-1{{ else }}{{ env "CONSUL_PORT" }}{{end}}
}
advertise_addr = "{{ env "HOST_IP" }}"
advertise_addr_wan = "{{ env "HOST_IP" }}"
{{ if eq (env "DATACENTER") (env "PRIMARY_DATACENTER") }}{{else}}
retry_join_wan = {{ env "HOST_LIST" }}
{{end}}
enable_central_service_config = true
connect {
enabled = true
}
acl = {
enabled = true
default_policy = "deny"
down_policy = "extend-cache"
# enable_token_persistence = true
{{ if eq (env "DATACENTER") (env "PRIMARY_DATACENTER") }}{{else}}
enable_token_replication = true
{{end}}
tokens = {
default = "{{ env "CONSUL_HTTP_TOKEN" }}"
replication = "{{ env "CONSUL_HTTP_TOKEN" }}"
}
}
verify_incoming = false
verify_incoming_rpc = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
verify_outgoing = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
verify_server_hostname = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
auto_encrypt = {
allow_tls = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
}
{{ if eq (env "CONSUL_SSL") "true" }}
ca_file = "{{ env "CONSUL_CACERT" }}"
cert_file = "{{ env "CONSUL_CLIENT_CERT" }}"
key_file = "{{ env "CONSUL_CLIENT_KEY" }}"
{{end}}
encrypt = "{{ env "CONSUL_ENCRYPT_KEY" }}"
encrypt_verify_incoming = true
encrypt_verify_outgoing = true
telemetry = {
dogstatsd_addr = "10.0.2.15:8125"
disable_hostname = true
}
and in my case once the config file is rendered it look like this
root@sfo-consul-server:/vagrant/provision/terraform/tf_cluster/primary# cat /var/consul/config/consul.hcl
data_dir = "/var/consul/config/"
log_level = "DEBUG"
datacenter = "sfo"
primary_datacenter = "sfo"
ui = true
server = true
bootstrap_expect = 1
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
ports {
grpc = 8502
https = 8500
http = -1
}
advertise_addr = "172.20.20.11"
advertise_addr_wan = "172.20.20.11"
#
enable_central_service_config = true
connect {
enabled = true
}
acl = {
enabled = true
default_policy = "deny"
down_policy = "extend-cache"
# enable_token_persistence = true
tokens = {
default = "daca8d74-b0de-05bf-cc23-e095244e514e"
replication = "daca8d74-b0de-05bf-cc23-e095244e514e"
}
}
verify_incoming = false
verify_incoming_rpc = true
verify_outgoing = true
verify_server_hostname = true
auto_encrypt = {
allow_tls = true
}
ca_file = "/var/vault/config/ca.crt.pem"
cert_file = "/var/vault/config/server.crt.pem"
key_file = "/var/vault/config/server.key.pem"
encrypt = "apEfb4TxRk3zGtrxxAjIkwUOgnVkaD88uFyMGHqKjIw="
encrypt_verify_incoming = true
encrypt_verify_outgoing = true
telemetry = {
dogstatsd_addr = "10.0.2.15:8125"
disable_hostname = true
}
Then I spin up my second dc (called nyc) using the same template file mentioned above and setting the default and replication token with the same value from cluster 1 (sfo)
and this allowed me to federate the clusters correctly
after that I could deploy now my services in cluster 1 (sfo), and any service didn't ask for the consul token.
the I deployed same service In cluster 2 (nyc) and any sidecar service didn't ask for consul token and they got registered correctly
look at the following example
root@nyc-consul-server:/home/vagrant# nomad logs -stderr 97e14028 connect-proxy-payment-api
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:238] initializing epoch 0 (hot restart version=disabled)
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:240] statically linked extensions:
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:242] access_loggers: envoy.file_access_log,envoy.http_grpc_access_log
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:245] filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:248] filters.listener: envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:251] filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:253] stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:255] tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.zipkin
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:258] transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:261] transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:267] buffer implementation: old (libevent)
[2020-05-21 03:19:17.955][1][warning][misc] [source/common/protobuf/utility.cc:199] Using deprecated option 'envoy.api.v2.Cluster.hosts' from file cds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2020-05-21 03:19:17.960][1][info][main] [source/server/server.cc:322] admin address: 127.0.0.1:19001
[2020-05-21 03:19:17.960][1][info][main] [source/server/server.cc:432] runtime: layers:
- name: static_layer
static_layer:
envoy.deprecated_features:envoy.config.trace.v2.ZipkinConfig.HTTP_JSON_V1: true
envoy.deprecated_features:envoy.api.v2.Cluster.tls_context: true
envoy.deprecated_features:envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager.Tracing.operation_name: true
[2020-05-21 03:19:17.960][1][warning][runtime] [source/common/runtime/runtime_impl.cc:497] Skipping unsupported runtime layer: name: "static_layer"
static_layer {
fields {
key: "envoy.deprecated_features:envoy.api.v2.Cluster.tls_context"
value {
bool_value: true
}
}
fields {
key: "envoy.deprecated_features:envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager.Tracing.operation_name"
value {
bool_value: true
}
}
fields {
key: "envoy.deprecated_features:envoy.config.trace.v2.ZipkinConfig.HTTP_JSON_V1"
value {
bool_value: true
}
}
}
[2020-05-21 03:19:17.960][1][info][config] [source/server/configuration_impl.cc:61] loading 0 static secret(s)
[2020-05-21 03:19:17.960][1][info][config] [source/server/configuration_impl.cc:67] loading 1 cluster(s)
[2020-05-21 03:19:17.963][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:144] cm init: initializing cds
[2020-05-21 03:19:17.965][1][info][config] [source/server/configuration_impl.cc:71] loading 0 listener(s)
[2020-05-21 03:19:17.965][1][info][config] [source/server/configuration_impl.cc:96] loading tracing configuration
[2020-05-21 03:19:17.965][1][info][config] [source/server/configuration_impl.cc:116] loading stats sink configuration
[2020-05-21 03:19:17.965][1][info][main] [source/server/server.cc:516] starting main dispatch loop
[2020-05-21 03:19:17.974][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:489] add/update cluster local_app during init
[2020-05-21 03:19:17.974][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:148] cm init: all clusters initialized
[2020-05-21 03:19:17.974][1][info][main] [source/server/server.cc:500] all clusters initialized. initializing init manager
[2020-05-21 03:19:17.977][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 03:19:17.977][1][info][config] [source/server/listener_manager_impl.cc:761] all dependencies initialized. starting workers
[2020-05-21 03:22:56.703][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 03:27:38.116][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 03:34:17.981][1][info][main] [source/server/drain_manager_impl.cc:63] shutting down parent after drain
[2020-05-21 03:36:09.007][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 04:14:04.376][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 04:15:07.144][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
but I had another issue, my health checks could be registered it was showing an error, so in order to continue I had to comment it so I could have my services and sidecar services up and running
just the mesh gateway which is deployed in both clusters, in cluster 1 (sfo) it didn't ask me for a token, but in cluster 2 it ask me for the token without it, it could get deployed and registered in consul.
so my nomad job looks like the following in cluster 2 (nyc)
job "cinemas" {
datacenters = ["nyc-ncv"]
region = "nyc-region"
type = "service"
group "payment-api" {
count = 1
network {
mode = "bridge"
port "healthcheck" {
to = -1
}
}
service {
name = "payment-api"
port = "3000"
// check {
// name = "payment-api-health"
// port = "healthcheck"
// type = "http"
// protocol = "http"
// path = "/ping"
// interval = "5s"
// timeout = "2s"
// expose = true
// }
connect {
sidecar_service {}
}
}
task "payment-api" {
driver = "docker"
config {
image = "crizstian/payment-service-go:v0.4"
}
env {
DB_SERVERS = "mongodb1.query.consul:27017,mongodb2.query.consul:27018,mongodb3.query.consul:27019"
SERVICE_PORT = "3000"
CONSUL_IP = "consul.service.consul"
CONSUL_SCHEME = "https"
CONSUL_HTTP_SSL = "true"
}
resources {
cpu = 50
memory = 50
}
}
}
group "notification-api" {
count = 1
network {
mode = "bridge"
port "healthcheck" {
to = -1
}
}
service {
name = "notification-api"
port = "3001"
// check {
// name = "notification-api-health"
// port = "healthcheck"
// type = "http"
// protocol = "http"
// path = "/ping"
// interval = "5s"
// timeout = "2s"
// expose = true
// }
connect {
sidecar_service {}
}
}
task "notification-api" {
driver = "docker"
config {
image = "crizstian/notification-service-go:v0.4"
}
env {
SERVICE_PORT = "3001"
CONSUL_IP = "consul.service.consul"
CONSUL_SCHEME = "https"
CONSUL_HTTP_SSL = "true"
}
resources {
cpu = 50
memory = 50
}
}
}
group "mesh-gateway" {
count = 1
task "mesh-gateway" {
driver = "raw_exec"
config {
command = "consul"
args = [
"connect", "envoy",
"-mesh-gateway",
"-register",
"-http-addr", "172.20.20.21:8500",
"-grpc-addr", "172.20.20.21:8502",
"-wan-address", "172.20.20.21:${NOMAD_PORT_proxy}",
"-address", "172.20.20.21:${NOMAD_PORT_proxy}",
"-bind-address", "default=172.20.20.21:${NOMAD_PORT_proxy}",
"-token", "daca8d74-b0de-05bf-cc23-e095244e514e",
"--",
"-l", "debug"
]
}
resources {
cpu = 100
memory = 100
network {
port "proxy" {
static = 8433
}
}
}
}
}
}
continuing with my mesh gateway configurations and testing I just got hit by another error, I couldn't register my consul intentions since my services doesn't appear in the dropdown menu, in cluster 1 (sfo) it only displays the services running in cluster 1 (sfo) to create an intention for source and destination, same behavior in cluster 2 (nyc) I couldn't see any service running in cluster 1 for creating an intention.
So now I am blocked with this, and I need to create this intentions in order to my services to communicate with service mesh, I have tried creating an intention with allow all to allow all, but this didn't work.
in previous consul versions I was able to create intentions and see the services in both clusters, something got changed here, that now I am not allowed.
Looks like we're both stuck at the same place now. We are using the Nicholas Jackson Consul-Envoy Proxy to register and unregister gateways via Nomad because we don't have access to the envoy binary on our images BUT the gateways do not unregister from Consul when we use this AND once we have registered a gateway we can't unregister it (manually) from Consul in our secondary cluster.
Regarding intentions @Crizstian - we set our intentions before doing this step https://learn.hashicorp.com/consul/developer-mesh/connect-gateways#configure-sidecar-proxies-to-use-gateways It seems if we do that step where we set the proxy-defaults then after that we start seeing token acl issues when navigating around the Consul UI between Datacenters (note this is just anecdotal as we're still trying to get to the bottom of why it suddenly starts giving errors on the UI)
We really are stuck at this stage and it doesn't seem like we're doing anything that is unique or funky so I'm at a loss on how to move forward. I'll keep working on this another day or 2 but we may have to look at alternative ways to achieve what we need if we can't get past this. I'll share what we find here, and @Crizstian thank you for sharing your steps in detail - seems we're on similar paths
@Crizstian - have you tried setting your intentions to star -> star and "Allow"? It seems if you set this in one DC it will show in the other too. I see the same issue you do where services in cluster A don't show as available in Cluster B for intentions. I've just done star->star for now
EDIT: I just used the command line and was able to make an intention between services in each datacenter BUT it still doesn't work. See Step 7 for outout from API call
So, summary of the steps.
Create a primary-cluster. Bootstrap ACLS. For simplicity the master key is used for everything. This cluster also has NOMAD (Not federated but using Consul)
Create a gateway-cluster that uses the Gossip Key from primary-cluster, and the ACL (Master) from primary-cluster. It has the datacenter name of itself, and the primary_datacenter name of the primary-cluster. This cluster also has NOMAD (not federated but using Consul)
Verify federation is happening per "consul members -wan" AND replication is working by running the following curl on the gateway-cluster: curl --request GET http://127.0.0.1:8500/v1/acl/replication
Run the mesh gateway jobs with the command In primary-cluster: consul connect envoy -mesh-gateway -register -service "gateway-primary" -address ":23000" -wan-address "HOSTIP:23000" -admin-bind 127.0.0.1:19005 -token=MASTER_TOKEN
In gateway-cluster consul connect envoy -mesh-gateway -register -service "gateway-secondary" -address ":23000" -wan-address "HOSTIP:23000" -admin-bind 127.0.0.1:19005 -token=MASTER_TOKEN
4.Ensure Intentions are set to ALLOW star-star (note for this example I also explicitly made another across dc with consul intention create api database)
Run the Database Job in primary-cluster
job "database-backend" { datacenters = ["primary-cluster"] type = "service"
group "database" { count = 1
network {
mode = "bridge"
}
service {
name = "database"
port = "25432"
connect {
sidecar_service {}
}
}
task "database" {
driver = "docker"
config {
image = "nicholasjackson/fake-service:v0.9.0"
}
env {
NAME = "database"
MESSAGE = "ok"
LISTEN_ADDR = "0.0.0.0:25432"
TIMING_VARIANCE = "25"
HTTP_CLIENT_KEEP_ALIVES = "true"
}
resources {
cpu = 100
memory = 256
}
}
} }
Run the api job in gateway-cluster
job "api-frontend" { datacenters = ["gateway-cluster"] type = "service" group "api" { count = 1 network { mode = "bridge" port "http" { to = 9090 static = 20345 } } service { name = "api" port = "9090" connect { sidecar_service { proxy { upstreams { destination_name = "database" local_bind_port = 25432 } } } } } task "api" { driver = "docker" config { image = "nicholasjackson/fake-service:v0.9.0" } env { NAME = "api" MESSAGE = "ok" LISTEN_ADDR = "0.0.0.0:9090" UPSTREAM_URIS = "http://localhost:25432" TIMING_VARIANCE = "25" HTTP_CLIENT_KEEP_ALIVES = "true" } resources { cpu = 100 memory = 256 } } } }
{ "name": "api", "uri": "/", "type": "HTTP", "ip_addresses": [ "172.26.64.3" ], "start_time": "2020-05-21T14:37:17.297829", "end_time": "2020-05-21T14:37:17.299712", "duration": "1.883618ms", "Headers": null, "upstream_calls": [ { "uri": "http://localhost:25432", "Headers": null, "code": -1, "error": "Error communicating with upstream service: Get http://localhost:25432/: EOF" } ], "code": 500 }
Just a small addition. The countdash app works on both the gateway-cluster and the primary-cluster (long as the call doesn't try to get proxied over the mesh gateway). So basically in the context of each DC Mesh is working fine, but DC to DC just doesn't work
i have found another issue, I have deployed my frontend services in cluster 1 (sfo) and my backend services in cluster 2 (nyc) and in cluster 1 (sfo) my containers can ping any service.consul
but in cluster 2 (nyc) the behavior is different, inside the containers I am no longer able to ping / dig / make any kind of request to services with the .consul
domain which means my service discovery doesn't work.
here is the output of the test made in cluster 2 (nyc)
root@nyc-consul-server:/home/vagrant# nomad plan /vagrant/deployment-files/mesh-gw/cinemas.dc2.hcl
+ Job: "cinemas"
+ Task Group: "mesh-gateway" (1 create)
+ Task: "mesh-gateway" (forces create)
+ Task Group: "notification-api" (1 create/destroy update)
+ Task: "connect-proxy-notification-api" (forces create)
+ Task: "notification-api" (forces create)
+ Task Group: "payment-api" (1 create/destroy update)
+ Task: "connect-proxy-payment-api" (forces create)
+ Task: "payment-api" (forces create)
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 0
To submit the job with version verification run:
nomad job run -check-index 0 /vagrant/deployment-files/mesh-gw/cinemas.dc2.hcl
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
root@nyc-consul-server:/home/vagrant# nomad job run -check-index 0 /vagrant/deployment-files/mesh-gw/cinemas.dc2.hcl
==> Monitoring evaluation "d51f4331"
Evaluation triggered by job "cinemas"
Evaluation within deployment: "bbcb6b10"
Allocation "c40cfdbc" created: node "3e5954a5", group "mesh-gateway"
Allocation "d30bf9af" created: node "3e5954a5", group "payment-api"
Allocation "e2716db5" created: node "3e5954a5", group "notification-api"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "d51f4331" finished with status "complete"
root@nyc-consul-server:/home/vagrant# nomad status cinemas
ID = cinemas
Name = cinemas
Submit Date = 2020-05-21T18:19:30Z
Type = service
Priority = 50
Datacenters = nyc-ncv
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
mesh-gateway 0 0 1 0 0 0
notification-api 0 0 1 0 0 0
payment-api 0 0 1 0 0 0
Latest Deployment
ID = bbcb6b10
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
mesh-gateway 1 1 0 0 2020-05-21T18:29:30Z
notification-api 1 1 0 0 2020-05-21T18:29:30Z
payment-api 1 1 0 0 2020-05-21T18:29:30Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
c40cfdbc 3e5954a5 mesh-gateway 0 run running 5s ago 5s ago
d30bf9af 3e5954a5 payment-api 0 run running 5s ago 3s ago
e2716db5 3e5954a5 notification-api 0 run running 5s ago 4s ago
root@nyc-consul-server:/home/vagrant# nomad alloc exec -task notification-api e2716db5 sh
/tmp # ping consul.service.consul
ping: bad address 'consul.service.consul'
/tmp # ping mongodb1.query.consul
ping: bad address 'mongodb1.query.consul'
/tmp # exit
root@nyc-consul-server:/home/vagrant# ping consul.service.consul
PING consul.service.consul (172.20.20.21) 56(84) bytes of data.
64 bytes from 172.20.20.21: icmp_seq=1 ttl=64 time=0.275 ms
64 bytes from 172.20.20.21: icmp_seq=2 ttl=64 time=0.073 ms
^C
--- consul.service.consul ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.073/0.174/0.275/0.101 ms
root@nyc-consul-server:/home/vagrant# ping mongodb.query.consul
ping: unknown host mongodb.query.consul
root@nyc-consul-server:/home/vagrant# ping mongodb1.query.consul
ping: unknown host mongodb1.query.consul
root@nyc-consul-server:/home/vagrant# nomad logs e2716db5 notification-api
CONSUL_SCHEME=https
CONSUL_HTTP_SSL=true
CONSUL_IP=consul.service.consul
CONSUL_PORT=8500
SSL IS ENABLED
Setting consul token
Setting custom command as the startup command
Setting secrets for notification-service if exists
Fetching role and secrets...
curl -s --cacert /tmp/ca.crt.pem --header "X-Consul-Token: " https://consul.service.consul:8500/v1/kv/cluster/apps/notification-service/auth/role_id
curl -s --cacert /tmp/ca.crt.pem --header "X-Consul-Token: " https://consul.service.consul:8500/v1/kv/cluster/apps/notification-service/auth/secret_id
role_id =
secret_id =
Generating secrets template file...
consul {
address = "consul.service.consul:8500"
ssl {
enabled = true
ca_cert = "/tmp/ca.crt.pem"
}
}
vault {
address = "https://vault.service.consul:8200"
token = ""
renew_token = false
ssl {
ca_cert = "/tmp/ca.crt.pem"
}
}
secret {
no_prefix = true
path = "secret/notification-service"
}Continuing with envconsul based startup...
root@nyc-consul-server:/home/vagrant# nomad logs -stderr e2716db5 notification-api
2020/05/21 18:19:41.092884 [WARN] (view) vault.read(secret/notification-service): vault.read(secret/notification-service): Get https://vault.service.consul:8200/v1/secret/notification-service: dial tcp: lookup vault.service.consul on 8.8.8.8:53: no such host (retry attempt 1 after "250ms")
know I now why my health checks were failing at the first time; so indeed they were working; now I will try by hardcoding consul / vault ips and see if I can have this work around to get it working
and this is the result of it my services started working
root@nyc-consul-server:/vagrant/provision/terraform/tf_cluster/secondary# nomad logs 2e8a5ebf notification-api
CONSUL_SCHEME=https
CONSUL_HTTP_SSL=true
CONSUL_IP=172.20.20.21
CONSUL_PORT=8500
SSL IS ENABLED
Setting consul token
Setting custom command as the startup command
Setting secrets for notification-service if exists
Fetching role and secrets...
curl -s --cacert /tmp/ca.crt.pem --header "X-Consul-Token: " https://172.20.20.21:8500/v1/kv/cluster/apps/notification-service/auth/role_id
curl -s --cacert /tmp/ca.crt.pem --header "X-Consul-Token: " https://172.20.20.21:8500/v1/kv/cluster/apps/notification-service/auth/secret_id
role_id = 1b745c14-e624-36de-1c51-1117d2a40174
secret_id = 2f90e3c9-1ec1-864c-2f54-e16b9b462ed6
Exporting VAULT_TOKEN
Vault request
curl --cacert /tmp/ca.crt.pem -s -X POST -d '{role_id:$role_id,secret_id:$secret_id}' https://172.20.20.11:8200/v1/auth/approle/login
Generating secrets template file...
consul {
address = "172.20.20.21:8500"
ssl {
enabled = true
ca_cert = "/tmp/ca.crt.pem"
}
}
vault {
address = "https://172.20.20.11:8200"
token = "s.9nj8zdSknTd3Hum59UYmGIAy"
renew_token = false
ssl {
ca_cert = "/tmp/ca.crt.pem"
}
}
secret {
no_prefix = true
path = "secret/notification-service"
}Continuing with envconsul based startup...
is CONSUL_HTTP_SSL = true
Setting consul token
Checking cluster state - active or standby
........................Cluster is in active state
Proceeding with startup...
time="2020-05-21T18:45:16Z" level=info msg="--- Notification Service ---"
time="2020-05-21T18:45:16Z" level=info msg="Connecting to notification repository..."
time="2020-05-21T18:45:16Z" level=info msg="Connected to Notification Repository"
time="2020-05-21T18:45:16Z" level=info msg="Starting Notification Service now ..."
____ __
/ __/___/ / ___
/ _// __/ _ \/ _ \
/___/\__/_//_/\___/ v3.3.10-dev
High performance, minimalist Go web framework
https://echo.labstack.com
____________________________________O/_______
O\
⇨ http server started on [::]:3001
unfortunately for my service connecting to the database in cluster 1 (sfo) through prepared queries it couldn't t make it
Checking cluster state - active or standby
Cluster is in active state
Proceeding with startup...
time="2020-05-21T18:47:28Z" level=info msg="--- Payment Service ---"
time="2020-05-21T18:47:28Z" level=info msg="connecting to db ...."
and once again I had to hardcode the address in order to work
time="2020-05-21T18:47:37Z" level=info msg="--- Payment Service ---"
time="2020-05-21T18:47:37Z" level=info msg="connecting to db ...."
time="2020-05-21T18:47:37Z" level=info msg="Connected to Payment Service DB"
time="2020-05-21T18:47:37Z" level=info msg="Connecting to payment repository..."
time="2020-05-21T18:47:37Z" level=info msg="Connected to Payment Repository"
time="2020-05-21T18:47:37Z" level=info msg="Starting Payment Service now ..."
____ __
/ __/___/ / ___
/ _// __/ _ \/ _ \
/___/\__/_//_/\___/ v3.3.10-dev
High performance, minimalist Go web framework
https://echo.labstack.com
____________________________________O/_______
O\
⇨ http server started on [::]:3000
So there is another problem discovered here:
when there is a federated cluster with ACLs enabled
so i will give it a last try going with ACLs but in allow mode to see what's the behavior and if it is the same I will be very disappointed to not use ACL system properly with connect on a federated cluster.
any thoughts @shoenig ??
And voilaaaa I made it work consul service mesh with gateways and ACLs enabled but with to much tweaking and with a service discovery that doesn't work for me in my second cluster (nyc)
this are my services in cluster 1 (sfo)
and this are my services in cluster 2 (nyc)
this is my job file deployed in cluster 1 (sfo)
job "cinemas" {
datacenters = ["sfo-ncv"]
region = "sfo-region"
type = "service"
group "booking-api" {
count = 1
network {
mode = "bridge"
port "http" {
static = 3002
to = 3002
}
port "healthcheck" {
to = -1
}
}
service {
name = "booking-api"
port = "http"
tags = ["cinemas-project"]
check {
name = "booking-api-health"
port = "healthcheck"
type = "http"
protocol = "http"
path = "/ping"
interval = "10s"
timeout = "3s"
expose = true
}
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "payment-api"
local_bind_port = 8080
}
upstreams {
destination_name = "notification-api"
local_bind_port = 8081
}
}
}
}
}
task "booking-api" {
driver = "docker"
config {
image = "crizstian/booking-service-go:v0.4"
}
env {
SERVICE_PORT = "3002"
DB_SERVERS = "mongodb1.query.consul:27017,mongodb2.query.consul:27018,mongodb3.query.consul:27019"
CONSUL_IP = "consul.service.consul"
CONSUL_SCHEME = "https"
CONSUL_HTTP_SSL = "true"
PAYMENT_URL = "http://${NOMAD_UPSTREAM_ADDR_payment_api}"
NOTIFICATION_URL = "http://${NOMAD_UPSTREAM_ADDR_notification_api}"
}
resources {
cpu = 50
memory = 50
}
}
}
group "mesh-gateway" {
count = 1
task "mesh-gateway" {
driver = "raw_exec"
config {
command = "consul"
args = [
"connect", "envoy",
"-mesh-gateway",
"-register",
"-service", "gateway-primary",
"-address", ":${NOMAD_PORT_proxy}",
"-wan-address", "172.20.20.11:${NOMAD_PORT_proxy}",
"-admin-bind", "127.0.0.1:19005",
"-token", "daca8d74-b0de-05bf-cc23-e095244e514e",
"-deregister-after-critical", "5s",
"--",
"-l", "debug"
]
}
resources {
cpu = 100
memory = 100
network {
port "proxy" {
static = 8433
}
}
}
}
}
}
and this is my job file deployed in cluster 2 (nyc)
job "cinemas" {
datacenters = ["nyc-ncv"]
region = "nyc-region"
type = "service"
group "payment-api" {
count = 1
network {
mode = "bridge"
port "healthcheck" {
to = -1
}
}
service {
name = "payment-api"
port = "3000"
check {
name = "payment-api-health"
port = "healthcheck"
type = "http"
protocol = "http"
path = "/ping"
interval = "5s"
timeout = "2s"
expose = true
}
connect {
sidecar_service {}
}
}
task "payment-api" {
driver = "docker"
config {
image = "crizstian/payment-service-go:v0.4"
}
env {
DB_SERVERS = "192.168.15.6:27017,192.168.15.6:27018,192.168.15.6:27019"
SERVICE_PORT = "3000"
CONSUL_IP = "172.20.20.21"
CONSUL_SCHEME = "https"
CONSUL_HTTP_SSL = "true"
VAULT_ADDR = "https://172.20.20.11:8200"
}
resources {
cpu = 50
memory = 50
}
}
}
group "notification-api" {
count = 1
network {
mode = "bridge"
port "healthcheck" {
to = -1
}
}
service {
name = "notification-api"
port = "3001"
check {
name = "notification-api-health"
port = "healthcheck"
type = "http"
protocol = "http"
path = "/ping"
interval = "5s"
timeout = "2s"
expose = true
}
connect {
sidecar_service {}
}
}
task "notification-api" {
driver = "docker"
config {
image = "crizstian/notification-service-go:v0.4"
}
env {
SERVICE_PORT = "3001"
CONSUL_IP = "172.20.20.21"
CONSUL_SCHEME = "https"
CONSUL_HTTP_SSL = "true"
VAULT_ADDR = "https://172.20.20.11:8200"
}
resources {
cpu = 50
memory = 50
}
}
}
group "mesh-gateway" {
count = 1
task "mesh-gateway" {
driver = "raw_exec"
config {
command = "consul"
args = [
"connect", "envoy",
"-mesh-gateway",
"-register",
"-service", "gateway-secondary",
"-address", ":${NOMAD_PORT_proxy}",
"-wan-address", "172.20.20.21:${NOMAD_PORT_proxy}",
"-admin-bind", "127.0.0.1:19005",
"-token", "daca8d74-b0de-05bf-cc23-e095244e514e",
"-deregister-after-critical", "5s",
]
}
resources {
cpu = 100
memory = 100
network {
port "proxy" {
static = 8433
}
}
}
}
}
}
i have deployed also the following configurations:
root@sfo-consul-server:/vagrant/provision/terraform/tf_cluster/primary# consul config list -kind service-defaults
booking-service
notification-api
payment-api
root@sfo-consul-server:/vagrant/provision/terraform/tf_cluster/primary# consul config list -kind proxy-defaults
global
they pretty much share almost the same config for the consul central config, I deployed this with terraform consul provider
variable "service_defaults_apps" {
default = [
{
name = "booking-service"
mesh_resolver = "local"
},
{
name = "notification-api"
mesh_resolver = "local"
service_resolver = {
DefaultSubset = "v1"
Subsets = {
"v1" = {
Filter = "Service.Meta.version == v1"
}
"v2" = {
Filter = "Service.Meta.version == v2"
}
}
Failover = {
"*" = {
Datacenters = ["nyc"]
}
}
}
},
{
name = "payment-api"
mesh_resolver = "local"
service_resolver = {
Failover = {
"*" = {
Datacenters = ["nyc"]
}
}
}
},
{
name = "count-dashboard"
mesh_resolver = "local"
},
{
name = "count-api"
mesh_resolver = "local"
}
]
}
variable "proxy_defaults" {
default = {
MeshGateway = {
Mode = "local"
}
}
}
variable "enable_service_defaults" {
default = false
}
variable "app_config_services" {
default = []
}
resource "consul_config_entry" "service-defaults" {
count = var.enable_service_defaults && length(var.app_config_services) > 0 ? length(var.app_config_services) : 0
name = var.app_config_services[count.index].name
kind = "service-defaults"
config_json = jsonencode({
Protocol = "http"
MeshGateway = {
Mode = var.app_config_services[count.index].mesh_resolver
}
})
}
the services still doesn't appear in the dropdown menu in either datacenter; so I had to enabled it with a start - star = allow intetion.
So I think we are almost there to have this consul properly working; now my problem is that in cluster 2 (nyc) the service discovery is not working which is huge huge problem; since that is the principal offer that consul brings to us.
Thanks for sharing that @Crizstian! Looks like the big catch was "service-defaults" ?
Can someone from Hashicorp comment on all of the above? This is a huge feature that's being highlighted but seems like it's extremely easy to get set up incorrectly and also that there are gaps in the documentation. Would be good to get some sort of official response on what should be happening here.
As i read the docs, the "proxy-default" should be enough
Probably @nickethier or @lkysow from Hashicorp any comments about this ?
@Crizstian - I decided to take nomad out of the mix and run just all this stuff native consul - no dice following even this tutorial
https://learn.hashicorp.com/consul/developer-mesh/connect-gateways
If I add a service-resolver to the config, I can get this to work too.
I verified if I shut the gateway down in the primary cluster, things stop working downstream (so I know it's actually using the gateway), and when I start the gateway up again then it all works.
WHEW!
kind = "service-resolver" name = "database-proxy" redirect { service = "database" datacenter = "primary-cluster" }
I Have tried everything from scratch and doing everything step by step, since I had done this before I had almost everything automated, so this are my steps and my founds.
1.- create two consul /nomad clusters 2.- bootstrap the primary datacenter 3.- set default and replication tokens with the primary consul root token 4.- configure secondary cluster; federate second cluster with retry_wan_join and set the default and replication token with the primary root token
until here everything look ok
5.- deployed my frontend services in primary cluster 6.- deployed with the same job the primary mesh-gateway
first problem found here; I deploy my mesh gateway using nomad as another service with the raw_exec
plugin like the following:
group "mesh-gateway" {
count = 1
task "mesh-gateway" {
driver = "raw_exec"
config {
command = "consul"
args = [
"connect", "envoy",
"-mesh-gateway",
"-register",
"-service", "gateway-primary",
"-address", ":${NOMAD_PORT_proxy}",
"-wan-address", "172.20.20.11:${NOMAD_PORT_proxy}",
"-admin-bind", "127.0.0.1:19005",
"-token", "c6759f14-1005-675c-1db6-18132ada0a39",
"-deregister-after-critical", "5s",
"--",
"-l", "debug"
]
}
resources {
network {
port "proxy" {
static = 8433
}
}
}
}
}
how can I avoid hardcoding the token value; my token is set an environment variable, I would like that the raw_exec
read the token value from the host environment variables.
7.- I deployed my backend services in secondary cluster and the secondary mesh gateway with same process from primary.
Then for configuring the consul central configs to enable consul connect with federation I have a terraform code to create all the consul configs required; so I could have it in an automated and versioned controlled way of doing this.
because of creating all this configs with terraform and seeing that there were properly created I thought everything was ok; so I ignore the terraform consul behavior. after one week struggling to make service mesh with mesh gateways to work; I discovered that the terraform consul provider doesn't work for consul intentions and consul central configs. first I discovered that consul intentions wasn't working so I created an issue in the terraform consul provider repo: https://github.com/terraform-providers/terraform-provider-consul/issues/194
and today I discovered that also consul central config are also not working when created with terraform. so I will also report this in that issue.
8.- i had to create consul central config by using consul cli
- consul config write /vagrant/provision/consul/central_config/mesh-gateway/notification-api-defaults.hcl
- consul config write /vagrant/provision/consul/central_config/mesh-gateway/booking-defaults.hcl
- consul config write /vagrant/provision/consul/central_config/mesh-gateway/payment-api-defaults.hcl
- consul config write /vagrant/provision/consul/central_config/failover/notification-api-resolver.hcl
- consul config write /vagrant/provision/consul/central_config/failover/payment-api-resolver.hcl
9.- Create consul intentions in consul cli / ui, since it doesn't work with terraform code as well
but here I had encounter another problem. I can't see all the services running in both datacenters in the consul intention dropdown menu, so I had to create a star -> star, action allow.
after doing this I could finally be able to setup all the consul connect features and mesh gateways as well and my frontend services were able to communicate to the backend services deployed in different datacenters.
Apparently there is only one issue with Nomad, and is how can my raw_exec
for the consul mesh gateway process to read my consul token from the host environment values.
So this is one Consul issue as well, which is on the consul intentions part, consul is not showing all my services and I am not able to create consul intentions properly, by specifying which service to which service allowed to talk.
And finally there are a lot of problems for Terraform Consul provider which is not creating the specified configurations set in the terraform code, which I am a little bit disappointed because I am not able to automate this consul configurations.
I hope this can get fixed soon or perhaps I am missing some silly steps which is leading me to these errors.
I hope I can continue with my testing for the new gateways for consul connect.
Hi @shoenig here are my config files as you requested in the Nomad office community hours in YouTube
Consul Files consul.hcl.tmpl
data_dir = "/var/consul/config/"
log_level = "DEBUG"
datacenter = "{{ env "DATACENTER" }}"
primary_datacenter = "{{ env "PRIMARY_DATACENTER" }}"
ui = true
server = true
bootstrap_expect = {{ env "CONSUL_SERVERS" }}
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
ports {
grpc = 8502
https = {{ if eq (env "CONSUL_SSL") "true" }}{{ env "CONSUL_PORT" }}{{ else }}-1{{end}}
http = {{ if eq (env "CONSUL_SSL") "true" }}-1{{ else }}{{ env "CONSUL_PORT" }}{{end}}
}
advertise_addr = "{{ env "HOST_IP" }}"
advertise_addr_wan = "{{ env "HOST_IP" }}"
{{ if eq (env "DATACENTER") (env "PRIMARY_DATACENTER") }}{{else}}
retry_join_wan = {{ env "HOST_LIST" }}
{{end}}
enable_central_service_config = true
connect {
enabled = true
}
acl = {
enabled = true
default_policy = "deny"
down_policy = "extend-cache"
# enable_token_persistence = true
{{ if eq (env "DATACENTER") (env "PRIMARY_DATACENTER") }}{{else}}
enable_token_replication = true
{{end}}
tokens = {
default = "{{ env "CONSUL_HTTP_TOKEN" }}"
replication = "{{ env "CONSUL_HTTP_TOKEN" }}"
}
}
verify_incoming = false
verify_incoming_rpc = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
verify_outgoing = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
verify_server_hostname = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
auto_encrypt = {
allow_tls = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
}
{{ if eq (env "CONSUL_SSL") "true" }}
ca_file = "{{ env "CONSUL_CACERT" }}"
cert_file = "{{ env "CONSUL_CLIENT_CERT" }}"
key_file = "{{ env "CONSUL_CLIENT_KEY" }}"
{{end}}
encrypt = "{{ env "CONSUL_ENCRYPT_KEY" }}"
encrypt_verify_incoming = true
encrypt_verify_outgoing = true
telemetry = {
dogstatsd_addr = "10.0.2.15:8125"
disable_hostname = true
}
this is the base file for both datacenters and for datacenter 1 consul-template renders it into the following
consul.hcl
data_dir = "/var/consul/config/"
log_level = "DEBUG"
datacenter = "sfo"
primary_datacenter = "sfo"
ui = true
server = true
bootstrap_expect = 1
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
ports {
grpc = 8502
https = 8500
http = -1
}
advertise_addr = "172.20.20.11"
advertise_addr_wan = "172.20.20.11"
enable_central_service_config = true
connect {
enabled = true
}
acl = {
enabled = true
default_policy = "deny"
down_policy = "extend-cache"
# enable_token_persistence = true
tokens = {
default = "45777651-66a1-4042-9479-cbcce7c775ac"
replication = "45777651-66a1-4042-9479-cbcce7c775ac"
}
}
verify_incoming = false
verify_incoming_rpc = true
verify_outgoing = true
verify_server_hostname = true
auto_encrypt = {
allow_tls = true
}
ca_file = "/var/vault/config/ca.crt.pem"
cert_file = "/var/vault/config/server.crt.pem"
key_file = "/var/vault/config/server.key.pem"
encrypt = "apEfb4TxRk3zGtrxxAjIkwUOgnVkaD88uFyMGHqKjIw="
encrypt_verify_incoming = true
encrypt_verify_outgoing = true
telemetry = {
dogstatsd_addr = "10.0.2.15:8125"
disable_hostname = true
}
this is the final render once I have done the consul bootstrap acl; before this it is the same render but without the consul token.
and for datacenter 2 this is the final render as well consul.hcl
data_dir = "/var/consul/config/"
log_level = "DEBUG"
datacenter = "nyc"
primary_datacenter = "sfo"
ui = true
server = true
bootstrap_expect = 1
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
ports {
grpc = 8502
https = 8500
http = -1
}
advertise_addr = "172.20.20.21"
advertise_addr_wan = "172.20.20.21"
retry_join_wan = ["172.20.20.11","172.20.20.21"]
enable_central_service_config = true
connect {
enabled = true
}
acl = {
enabled = true
default_policy = "deny"
down_policy = "extend-cache"
# enable_token_persistence = true
enable_token_replication = true
tokens = {
default = "45777651-66a1-4042-9479-cbcce7c775ac"
replication = "45777651-66a1-4042-9479-cbcce7c775ac"
}
}
verify_incoming = false
verify_incoming_rpc = true
verify_outgoing = true
verify_server_hostname = true
auto_encrypt = {
allow_tls = true
}
ca_file = "/var/vault/config/ca.crt.pem"
cert_file = "/var/vault/config/server.crt.pem"
key_file = "/var/vault/config/server.key.pem"
encrypt = "apEfb4TxRk3zGtrxxAjIkwUOgnVkaD88uFyMGHqKjIw="
encrypt_verify_incoming = true
encrypt_verify_outgoing = true
telemetry = {
dogstatsd_addr = "10.0.2.15:8125"
disable_hostname = true
}
My nomad files are the following nomad.hcl.tmpl
bind_addr = "{{ env "HOST_IP" }}"
datacenter = "{{ env "DATACENTER" }}-ncv"
region = "{{ env "DATACENTER" }}-region"
data_dir = "/var/nomad/data"
log_level = "DEBUG"
leave_on_terminate = true
leave_on_interrupt = true
disable_update_check = true
client {
enabled = true
host_volume "ca-certificates" {
path = "/var/vault/config"
read_only = true
}
}
addresses {
rpc = "{{ env "HOST_IP" }}"
http = "{{ env "HOST_IP" }}"
serf = "{{ env "HOST_IP" }}"
}
advertise {
http = "{{ env "HOST_IP" }}:4646"
rpc = "{{ env "HOST_IP" }}:4647"
serf = "{{ env "HOST_IP" }}:4648"
}
consul {
address = "{{ env "HOST_IP" }}:8500"
client_service_name = "nomad-{{ env "DATACENTER" }}-client"
server_service_name = "nomad-{{ env "DATACENTER" }}-server"
auto_advertise = true
server_auto_join = true
client_auto_join = true
ca_file = "{{ env "CONSUL_CACERT" }}"
cert_file = "{{ env "CONSUL_CLIENT_CERT" }}"
key_file = "{{ env "CONSUL_CLIENT_KEY" }}"
ssl = {{ env "CONSUL_SSL" }}
verify_ssl = {{ env "CONSUL_SSL" }}
token = "{{ env "CONSUL_HTTP_TOKEN" }}"
}
server {
enabled = true
bootstrap_expect = {{ env "NOMAD_SERVERS" }}
}
tls {
http = true
rpc = true
ca_file = "{{ env "NOMAD_CACERT" }}"
cert_file = "{{ env "NOMAD_CLIENT_CERT" }}"
key_file = "{{ env "NOMAD_CLIENT_KEY" }}"
verify_https_client = false
verify_server_hostname = true
}
plugin "raw_exec" {
config {
enabled = true
}
}
and the rendered file for datacenter 1 is the following nomad.hcl
bind_addr = "172.20.20.11"
datacenter = "sfo-ncv"
region = "sfo-region"
data_dir = "/var/nomad/data"
log_level = "DEBUG"
leave_on_terminate = true
leave_on_interrupt = true
disable_update_check = true
client {
enabled = true
host_volume "ca-certificates" {
path = "/var/vault/config"
read_only = true
}
}
addresses {
rpc = "172.20.20.11"
http = "172.20.20.11"
serf = "172.20.20.11"
}
advertise {
http = "172.20.20.11:4646"
rpc = "172.20.20.11:4647"
serf = "172.20.20.11:4648"
}
consul {
address = "172.20.20.11:8500"
client_service_name = "nomad-sfo-client"
server_service_name = "nomad-sfo-server"
auto_advertise = true
server_auto_join = true
client_auto_join = true
ca_file = "/var/vault/config/ca.crt.pem"
cert_file = "/var/vault/config/server.crt.pem"
key_file = "/var/vault/config/server.key.pem"
ssl = true
verify_ssl = true
token = "45777651-66a1-4042-9479-cbcce7c775ac"
}
server {
enabled = true
bootstrap_expect = 1
}
tls {
http = true
rpc = true
ca_file = "/var/vault/config/ca.crt.pem"
cert_file = "/var/vault/config/server.crt.pem"
key_file = "/var/vault/config/server.key.pem"
verify_https_client = false
verify_server_hostname = true
}
plugin "raw_exec" {
config {
enabled = true
}
}
this is rendered after consul process is done and the second datacenter nomad file is nomad.hcl
bind_addr = "172.20.20.21"
datacenter = "nyc-ncv"
region = "nyc-region"
data_dir = "/var/nomad/data"
log_level = "DEBUG"
leave_on_terminate = true
leave_on_interrupt = true
disable_update_check = true
client {
enabled = true
host_volume "ca-certificates" {
path = "/var/vault/config"
read_only = true
}
}
addresses {
rpc = "172.20.20.21"
http = "172.20.20.21"
serf = "172.20.20.21"
}
advertise {
http = "172.20.20.21:4646"
rpc = "172.20.20.21:4647"
serf = "172.20.20.21:4648"
}
consul {
address = "172.20.20.21:8500"
client_service_name = "nomad-nyc-client"
server_service_name = "nomad-nyc-server"
auto_advertise = true
server_auto_join = true
client_auto_join = true
ca_file = "/var/vault/config/ca.crt.pem"
cert_file = "/var/vault/config/server.crt.pem"
key_file = "/var/vault/config/server.key.pem"
ssl = true
verify_ssl = true
token = "45777651-66a1-4042-9479-cbcce7c775ac"
}
server {
enabled = true
bootstrap_expect = 1
}
tls {
http = true
rpc = true
ca_file = "/var/vault/config/ca.crt.pem"
cert_file = "/var/vault/config/server.crt.pem"
key_file = "/var/vault/config/server.key.pem"
verify_https_client = false
verify_server_hostname = true
}
plugin "raw_exec" {
config {
enabled = true
}
}
Consul federation is done by the retry_join
attribute in the second datacenter consul config file
what I am testing is the consul connect mesh gateway and I have 3 different services: 1 frontend service = booking service 2 backend services = payment and notification services 1 mesh gateway for each datacenter
my nomad jobs are cinemas.dc1.hcl
job "cinemas" {
datacenters = ["sfo-ncv"]
region = "sfo-region"
type = "service"
group "booking-api" {
count = 1
network {
mode = "bridge"
port "http" {
static = 3002
to = 3002
}
port "healthcheck" {
to = -1
}
}
service {
name = "booking-api"
port = "http"
tags = ["cinemas-project"]
check {
name = "booking-api-health"
port = "healthcheck"
type = "http"
protocol = "http"
path = "/ping"
interval = "10s"
timeout = "3s"
expose = true
}
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "payment-api"
local_bind_port = 8080
}
upstreams {
destination_name = "notification-api"
local_bind_port = 8081
}
}
}
}
}
task "booking-api" {
driver = "docker"
config {
image = "crizstian/booking-service-go:v0.4"
}
env {
SERVICE_PORT = "3002"
DB_SERVERS = "mongodb1.query.consul:27017,mongodb2.query.consul:27018,mongodb3.query.consul:27019"
CONSUL_IP = "consul.service.consul"
CONSUL_SCHEME = "https"
CONSUL_HTTP_SSL = "true"
PAYMENT_URL = "http://${NOMAD_UPSTREAM_ADDR_payment_api}"
NOTIFICATION_URL = "http://${NOMAD_UPSTREAM_ADDR_notification_api}"
}
resources {
cpu = 50
memory = 50
}
}
}
group "mesh-gateway" {
count = 1
task "mesh-gateway" {
driver = "raw_exec"
config {
command = "consul"
args = [
"connect", "envoy",
"-mesh-gateway",
"-register",
"-service", "gateway-primary",
"-address", ":${NOMAD_PORT_proxy}",
"-wan-address", "172.20.20.11:${NOMAD_PORT_proxy}",
"-admin-bind", "127.0.0.1:19005",
"-token", "c6759f14-1005-675c-1db6-18132ada0a39",
"-deregister-after-critical", "5s",
"--",
"-l", "debug"
]
}
resources {
cpu = 100
memory = 100
network {
port "proxy" {
static = 8433
}
}
}
}
}
}
and cinemas.dc2.hcl
job "cinemas" {
datacenters = ["nyc-ncv"]
region = "nyc-region"
type = "service"
group "payment-api" {
count = 1
network {
mode = "bridge"
port "healthcheck" {
to = -1
}
}
service {
name = "payment-api"
port = "3000"
check {
name = "payment-api-health"
port = "healthcheck"
type = "http"
protocol = "http"
path = "/ping"
interval = "5s"
timeout = "2s"
expose = true
}
connect {
sidecar_service {}
}
}
task "payment-api" {
driver = "docker"
config {
image = "crizstian/payment-service-go:v0.4"
}
env {
DB_SERVERS = "mongodb1.query.consul:27017,mongodb2.query.consul:27018,mongodb3.query.consul:27019"
SERVICE_PORT = "3000"
CONSUL_IP = "consul.service.consul"
CONSUL_SCHEME = "https"
CONSUL_HTTP_SSL = "true"
}
resources {
cpu = 50
memory = 50
}
}
}
group "notification-api" {
count = 1
network {
mode = "bridge"
port "healthcheck" {
to = -1
}
}
service {
name = "notification-api"
port = "3001"
check {
name = "notification-api-health"
port = "healthcheck"
type = "http"
protocol = "http"
path = "/ping"
interval = "5s"
timeout = "2s"
expose = true
}
connect {
sidecar_service {}
}
}
task "notification-api" {
driver = "docker"
config {
image = "crizstian/notification-service-go:v0.4"
}
env {
SERVICE_PORT = "3001"
CONSUL_IP = "consul.service.consul"
CONSUL_SCHEME = "https"
CONSUL_HTTP_SSL = "true"
}
resources {
cpu = 50
memory = 50
}
}
}
group "mesh-gateway" {
count = 1
task "mesh-gateway" {
driver = "raw_exec"
config {
command = "consul"
args = [
"connect", "envoy",
"-mesh-gateway",
"-register",
"-service", "gateway-secondary",
"-address", ":${NOMAD_PORT_proxy}",
"-wan-address", "172.20.20.21:${NOMAD_PORT_proxy}",
"-admin-bind", "127.0.0.1:19005",
"-token", "c6759f14-1005-675c-1db6-18132ada0a39",
"-deregister-after-critical", "5s",
]
}
resources {
cpu = 100
memory = 100
network {
port "proxy" {
static = 8433
}
}
}
}
}
}
as you can see in both datacenters I need to set my consul token for mesh gateway task and I will like not to hardcode it, I would like to read from my host env variables or why nomad doesn't use the token set in its configuration to register this service.
@Crizstian re
So this is one Consul issue as well, which is on the consul intentions part, consul is not showing all my services and I am not able to create consul intentions properly, by specifying which service to which service allowed to talk.
This is a known issue: https://github.com/hashicorp/consul/issues/7390 however you can just type in the name of the service in the other DCs into the dropdown and the intention will still work.
@Crizstian re
So this is one Consul issue as well, which is on the consul intentions part, consul is not showing all my services and I am not able to create consul intentions properly, by specifying which service to which service allowed to talk.
This is a known issue: hashicorp/consul#7390 however you can just type in the name of the service in the other DCs into the dropdown and the intention will still work.
@lkysow I tried what you said but I use the terraform code and it worked I believe it is not a huge problem if we are using terraform to create the intentions.
The only thing I see is that if my services are not registered in consul and I create first my intentions with terraform then deploy my services it doesn't work, but if I deploy first my services and the create my intentions it works, is this the expected behavior ?
The only thing I see is that if my services are not registered in consul and I create first my intentions with terraform then deploy my services it doesn't work, but if I deploy first my services and the create my intentions it works, is this the expected behavior ?
You can create intentions without a service with that name existing so I don't think this is expected behaviour.
Nomad version
Output from
nomad version
Operating system and Environment details
Linux / ubuntu
Issue
Have a consul federated cluster with TLS and ACLs enabled
deploy a consul connect job // count dash can work
deploy one service in dc1 and the second service in dc2
dc1 can deploy everything as expected
dc2 needs a replication acl token, and needs consul dc1 token in order to work
and from here everything in dc2 ask for a token, nomad doesn't get / read the consul token environment variable that is set in the host any more and it doesn't register the services correctly
Job file (if appropriate)
Nomad Client logs (if appropriate)
Even the consul health check defined also doesn't work
i also tried setting the consul token at the job level and as well a tried to set the token like the following:
Looks like if consul federation works properly but looks like the acl system isn't working properly or something is missing in my acl config for a federation cluster