hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

Consul Connect with Consul ACL enabled in a federated cluster Nomad Task doesn't get registered #8019

Open crizstian opened 4 years ago

crizstian commented 4 years ago

Nomad version

Output from nomad version

root@nyc-consul-server:/vagrant/provision/terraform/tf_cluster/secondary# nomad version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)
root@nyc-consul-server:/vagrant/provision/terraform/tf_cluster/secondary# consul version
Consul v1.7.3
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Operating system and Environment details

Linux / ubuntu

Issue

Have a consul federated cluster with TLS and ACLs enabled

deploy a consul connect job // count dash can work

deploy one service in dc1 and the second service in dc2

dc1 can deploy everything as expected

dc2 needs a replication acl token, and needs consul dc1 token in order to work

and from here everything in dc2 ask for a token, nomad doesn't get / read the consul token environment variable that is set in the host any more and it doesn't register the services correctly

Job file (if appropriate)

job "cinemas" {

  datacenters = ["nyc-ncv"]
  region      = "nyc-region"
  type        = "service"

  group "payment-api" {
    count = 1

    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "payment-api"
      port = "3000"

      check {
        name     = "payment-api-health"
        port     = "healthcheck"
        type     = "http"
        protocol = "http"
        path     = "/ping"
        interval = "5s"
        timeout  = "2s"
        expose   = true
      }

      connect {
        sidecar_service {}
      }
    }

    task "payment-api" {
      driver = "docker"

      config {
        image = "crizstian/payment-service-go:v0.4"
      }

      env {
        DB_SERVERS      = "mongodb1.query.consul:27017,mongodb2.query.consul:27018,mongodb3.query.consul:27019"
        SERVICE_PORT    = "3000"
        CONSUL_IP       = "consul.service.consul"
        CONSUL_SCHEME   = "https"
        CONSUL_HTTP_SSL = "true"
      }

      resources {
        cpu    = 50
        memory = 50
      }
    }
  }
}

Nomad Client logs (if appropriate)


root@nyc-consul-server:/vagrant/provision/terraform/tf_cluster/secondary# nomad status cinemas
ID            = cinemas
Name          = cinemas
Submit Date   = 2020-05-19T19:26:35Z
Type          = service
Priority      = 50
Datacenters   = nyc-ncv
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group   Queued  Starting  Running  Failed  Complete  Lost
payment-api  0       0         1        0       0         0

Latest Deployment
ID          = 97bc5bdc
Status      = running
Description = Deployment is running

Deployed
Task Group   Desired  Placed  Healthy  Unhealthy  Progress Deadline
payment-api  1        1       0        0          2020-05-19T19:36:35Z

Allocations
ID        Node ID   Task Group   Version  Desired  Status   Created  Modified
788f0165  5fd6be7a  payment-api  0        run      running  12s ago  9s ago
root@nyc-consul-server:/vagrant/provision/terraform/tf_cluster/secondary# nomad logs 788f0165
Allocation "788f0165" is running the following tasks:
  * payment-api
  * connect-proxy-payment-api

Please specify the task.
root@nyc-consul-server:/vagrant/provision/terraform/tf_cluster/secondary# nomad logs 788f0165 connect-proxy-payment-api
root@nyc-consul-server:/vagrant/provision/terraform/tf_cluster/secondary# nomad logs -stderr 788f0165 connect-proxy-payment-api
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:238] initializing epoch 0 (hot restart version=disabled)
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:240] statically linked extensions:
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:242]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:245]   filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:248]   filters.listener: envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:251]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:253]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:255]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.zipkin
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:258]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:261]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[2020-05-19 19:26:38.228][1][info][main] [source/server/server.cc:267] buffer implementation: old (libevent)
[2020-05-19 19:26:38.233][1][warning][misc] [source/common/protobuf/utility.cc:199] Using deprecated option 'envoy.api.v2.Cluster.hosts' from file cds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2020-05-19 19:26:38.236][1][info][main] [source/server/server.cc:322] admin address: 127.0.0.1:19001
[2020-05-19 19:26:38.236][1][info][main] [source/server/server.cc:432] runtime: layers:
  - name: static_layer
    static_layer:
      envoy.deprecated_features:envoy.config.trace.v2.ZipkinConfig.HTTP_JSON_V1: true
      envoy.deprecated_features:envoy.api.v2.Cluster.tls_context: true
      envoy.deprecated_features:envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager.Tracing.operation_name: true
[2020-05-19 19:26:38.237][1][warning][runtime] [source/common/runtime/runtime_impl.cc:497] Skipping unsupported runtime layer: name: "static_layer"
static_layer {
  fields {
    key: "envoy.deprecated_features:envoy.api.v2.Cluster.tls_context"
    value {
      bool_value: true
    }
  }
  fields {
    key: "envoy.deprecated_features:envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager.Tracing.operation_name"
    value {
      bool_value: true
    }
  }
  fields {
    key: "envoy.deprecated_features:envoy.config.trace.v2.ZipkinConfig.HTTP_JSON_V1"
    value {
      bool_value: true
    }
  }
}

[2020-05-19 19:26:38.237][1][info][config] [source/server/configuration_impl.cc:61] loading 0 static secret(s)
[2020-05-19 19:26:38.237][1][info][config] [source/server/configuration_impl.cc:67] loading 1 cluster(s)
[2020-05-19 19:26:38.244][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:144] cm init: initializing cds
[2020-05-19 19:26:38.245][1][info][config] [source/server/configuration_impl.cc:71] loading 0 listener(s)
[2020-05-19 19:26:38.246][1][info][config] [source/server/configuration_impl.cc:96] loading tracing configuration
[2020-05-19 19:26:38.246][1][info][config] [source/server/configuration_impl.cc:116] loading stats sink configuration
[2020-05-19 19:26:38.246][1][info][main] [source/server/server.cc:516] starting main dispatch loop
[2020-05-19 19:26:38.252][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 7, permission denied
[2020-05-19 19:26:38.252][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:148] cm init: all clusters initialized
[2020-05-19 19:26:38.252][1][info][main] [source/server/server.cc:500] all clusters initialized. initializing init manager
[2020-05-19 19:26:38.638][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 7, permission denied
[2020-05-19 19:26:38.638][1][info][config] [source/server/listener_manager_impl.cc:761] all dependencies initialized. starting workers
[2020-05-19 19:26:39.002][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 7, permission denied
[2020-05-19 19:26:41.349][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 7, permission denied
[2020-05-19 19:26:43.117][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 7, permission denied
[2020-05-19 19:26:57.793][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 7, permission denied
Captura de Pantalla 2020-05-19 a la(s) 14 34 41

Even the consul health check defined also doesn't work

Captura de Pantalla 2020-05-19 a la(s) 14 38 16

i also tried setting the consul token at the job level and as well a tried to set the token like the following:

      connect {
        sidecar_service {
proxy {
config {
token = "........"
}
}
}
      }

Looks like if consul federation works properly but looks like the acl system isn't working properly or something is missing in my acl config for a federation cluster

idrennanvmware commented 4 years ago

https://github.com/hashicorp/consul/issues/7906

Was what we found - we don't have TLS enabled but the behavior you describe is the same. Its a bit strange as Consul "looks" like it's working but it isn't without the replication token set in the agent.

Does this help you resolve the issue @Crizstian ?

crizstian commented 4 years ago

@idrennanvmware for didn't work or maybe I am missing something I am still getting errors after federation with ACL enabled

    2020-05-20T16:38:54.209Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=
    2020-05-20T16:38:54.230Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/vault/core/lock from=172.20.20.21:43596 error="ACL not found"
    2020-05-20T16:38:54.230Z [DEBUG] agent.http: Request finished: method=GET url=/v1/kv/vault/core/lock from=172.20.20.21:43596 latency=399.873µs
    2020-05-20T16:38:54.234Z [ERROR] agent.http: Request error: method=GET url=/v1/catalog/service/vault?stale= from=172.20.20.21:43596 error="ACL not found"
    2020-05-20T16:38:54.235Z [DEBUG] agent.http: Request finished: method=GET url=/v1/catalog/service/vault?stale= from=172.20.20.21:43596 latency=2.821513ms
    2020-05-20T16:38:54.236Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/register from=172.20.20.21:43596 error="ACL not found"
    2020-05-20T16:38:54.237Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/agent/service/register from=172.20.20.21:43596 latency=205.153µs

Steps I followed are:

1 .- spin up two datacenters 2.- bootstap ACL system in both dcs 3.- update secondary dc config, setting primary_dc value 4.- create acl replica policy as following the documentation 5.- set replica agent token in secondary dc

Previously I had done

1.- spin up 2 datacenters 2.- automatically federate clusters with retry_join_wan 3.- bootstrap dc1 4.- set dc1 root token as agent token for dc2

but same failures I am getting, probably I am missing something silly or it consul that is not working with federation wit acl enabled.

idrennanvmware commented 4 years ago

Hi @Crizstian - let me give our our steps (we did NOT have success in your approach of bootstrapping consul in 2 places).

  1. Set up Consul (and bootstrap on primary cluster). At the end of this you should have ACL tokens, and the replication token set to the value of your token with the right permissions. Lets call this "primary-cluster". You will need to have ensured that the following are set : "enable_central_service_config": true "tokens": {

    "replication": "TOKEN_HERE" } "datacenter": "primary-cluster" "primary_datacenter": "primary-cluster"
  2. Set up consul on cluster 2 BUT this time make sure that your gossip key AND your acl token match the output from step 1 AND make sure that you set "primary_datacenter" in your config to the cluster you set up in step 1. So in our case "cluster_name" is gateway-cluster, and primary datacenter is "primary-cluster"

    "enable_central_service_config": true "tokens": {

    "replication": "TOKEN_HERE" } "datacenter": "gateway-cluster" "primary_datacenter": "primary-cluster" "retry_join_wan": ["primary-cluster-csl1", "primary-cluster-csl2"]

That's the way we have been able to reliably set ours up.

The only difference between what you are doing and we are doing is (as I understand it) that we don't have TLS on agents set up yet in the scenarios here

EDIT: Don't forget to set your intentions in Consul as well (I do star->star for testing)

idrennanvmware commented 4 years ago

@Crizstian The above is as far as we've gotten. we've had ZERO success having a service in one DC talk to another DC (example CountDash Dashboard in DC1 talk to CountDash API in DC2.

@tgross - do you have any example jobs for Mesh Gateway testing? Rule out PBKEC on this end :)

edit: I've been working on this all day and gotten no where. I still suspect something isn't quite right either on the federation side or the mesh side but I ended up in a scenario I coudn't even unregister the service in the secondary cluster.

crizstian commented 4 years ago

i have made some progress I have to do the following:

1.- spin up my cluster 1 (called sfo) 2.- bootstrap acl in cluster 1 3.- set the default and replication token; for simplicity I just set the consul root token for both

this is my config file for any consul server

data_dir = "/var/consul/config/"
log_level = "DEBUG"

datacenter         = "{{ env "DATACENTER" }}"
primary_datacenter = "{{ env "PRIMARY_DATACENTER" }}"

ui     = true
server = true
bootstrap_expect = {{ env "CONSUL_SERVERS" }}

bind_addr   = "0.0.0.0"
client_addr = "0.0.0.0"

ports {
  grpc  = 8502
  https = {{ if eq (env "CONSUL_SSL") "true" }}{{ env "CONSUL_PORT" }}{{ else }}-1{{end}}
  http  = {{ if eq (env "CONSUL_SSL") "true" }}-1{{ else }}{{ env "CONSUL_PORT" }}{{end}}
}

advertise_addr     = "{{ env "HOST_IP" }}"
advertise_addr_wan = "{{ env "HOST_IP" }}"

{{ if eq (env "DATACENTER") (env "PRIMARY_DATACENTER") }}{{else}}
retry_join_wan = {{ env "HOST_LIST" }}
{{end}}

enable_central_service_config = true

connect {
  enabled = true
}

acl = {
  enabled        = true
  default_policy = "deny"
  down_policy    = "extend-cache"
  # enable_token_persistence = true
  {{ if eq (env "DATACENTER") (env "PRIMARY_DATACENTER") }}{{else}}
  enable_token_replication = true
  {{end}}
  tokens = {
    default     = "{{ env "CONSUL_HTTP_TOKEN" }}"
    replication = "{{ env "CONSUL_HTTP_TOKEN" }}"
  }
}

verify_incoming        = false
verify_incoming_rpc    = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
verify_outgoing        = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
verify_server_hostname = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}

auto_encrypt = {
  allow_tls = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
}

{{ if eq (env "CONSUL_SSL") "true" }}
ca_file    = "{{ env "CONSUL_CACERT" }}"
cert_file  = "{{ env "CONSUL_CLIENT_CERT" }}"
key_file   = "{{ env "CONSUL_CLIENT_KEY" }}"
{{end}}

encrypt = "{{ env "CONSUL_ENCRYPT_KEY" }}"
encrypt_verify_incoming = true
encrypt_verify_outgoing = true

telemetry = {
  dogstatsd_addr   = "10.0.2.15:8125"
  disable_hostname = true
}

and in my case once the config file is rendered it look like this

root@sfo-consul-server:/vagrant/provision/terraform/tf_cluster/primary# cat /var/consul/config/consul.hcl
data_dir = "/var/consul/config/"
log_level = "DEBUG"

datacenter         = "sfo"
primary_datacenter = "sfo"

ui     = true
server = true
bootstrap_expect = 1

bind_addr   = "0.0.0.0"
client_addr = "0.0.0.0"

ports {
  grpc  = 8502
  https = 8500
  http  = -1
}

advertise_addr     = "172.20.20.11"
advertise_addr_wan = "172.20.20.11"

#

enable_central_service_config = true

connect {
  enabled = true
}

acl = {
  enabled        = true
  default_policy = "deny"
  down_policy    = "extend-cache"
  # enable_token_persistence = true

  tokens = {
    default     = "daca8d74-b0de-05bf-cc23-e095244e514e"
    replication = "daca8d74-b0de-05bf-cc23-e095244e514e"
  }
}

verify_incoming        = false
verify_incoming_rpc    = true
verify_outgoing        = true
verify_server_hostname = true

auto_encrypt = {
  allow_tls = true
}

ca_file    = "/var/vault/config/ca.crt.pem"
cert_file  = "/var/vault/config/server.crt.pem"
key_file   = "/var/vault/config/server.key.pem"

encrypt = "apEfb4TxRk3zGtrxxAjIkwUOgnVkaD88uFyMGHqKjIw="
encrypt_verify_incoming = true
encrypt_verify_outgoing = true

telemetry = {
  dogstatsd_addr   = "10.0.2.15:8125"
  disable_hostname = true
}

Then I spin up my second dc (called nyc) using the same template file mentioned above and setting the default and replication token with the same value from cluster 1 (sfo)

and this allowed me to federate the clusters correctly

after that I could deploy now my services in cluster 1 (sfo), and any service didn't ask for the consul token.

the I deployed same service In cluster 2 (nyc) and any sidecar service didn't ask for consul token and they got registered correctly

look at the following example

root@nyc-consul-server:/home/vagrant# nomad logs -stderr 97e14028 connect-proxy-payment-api
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:238] initializing epoch 0 (hot restart version=disabled)
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:240] statically linked extensions:
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:242]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:245]   filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:248]   filters.listener: envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:251]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:253]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:255]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.zipkin
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:258]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:261]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[2020-05-21 03:19:17.954][1][info][main] [source/server/server.cc:267] buffer implementation: old (libevent)
[2020-05-21 03:19:17.955][1][warning][misc] [source/common/protobuf/utility.cc:199] Using deprecated option 'envoy.api.v2.Cluster.hosts' from file cds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2020-05-21 03:19:17.960][1][info][main] [source/server/server.cc:322] admin address: 127.0.0.1:19001
[2020-05-21 03:19:17.960][1][info][main] [source/server/server.cc:432] runtime: layers:
  - name: static_layer
    static_layer:
      envoy.deprecated_features:envoy.config.trace.v2.ZipkinConfig.HTTP_JSON_V1: true
      envoy.deprecated_features:envoy.api.v2.Cluster.tls_context: true
      envoy.deprecated_features:envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager.Tracing.operation_name: true
[2020-05-21 03:19:17.960][1][warning][runtime] [source/common/runtime/runtime_impl.cc:497] Skipping unsupported runtime layer: name: "static_layer"
static_layer {
  fields {
    key: "envoy.deprecated_features:envoy.api.v2.Cluster.tls_context"
    value {
      bool_value: true
    }
  }
  fields {
    key: "envoy.deprecated_features:envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager.Tracing.operation_name"
    value {
      bool_value: true
    }
  }
  fields {
    key: "envoy.deprecated_features:envoy.config.trace.v2.ZipkinConfig.HTTP_JSON_V1"
    value {
      bool_value: true
    }
  }
}

[2020-05-21 03:19:17.960][1][info][config] [source/server/configuration_impl.cc:61] loading 0 static secret(s)
[2020-05-21 03:19:17.960][1][info][config] [source/server/configuration_impl.cc:67] loading 1 cluster(s)
[2020-05-21 03:19:17.963][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:144] cm init: initializing cds
[2020-05-21 03:19:17.965][1][info][config] [source/server/configuration_impl.cc:71] loading 0 listener(s)
[2020-05-21 03:19:17.965][1][info][config] [source/server/configuration_impl.cc:96] loading tracing configuration
[2020-05-21 03:19:17.965][1][info][config] [source/server/configuration_impl.cc:116] loading stats sink configuration
[2020-05-21 03:19:17.965][1][info][main] [source/server/server.cc:516] starting main dispatch loop
[2020-05-21 03:19:17.974][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:489] add/update cluster local_app during init
[2020-05-21 03:19:17.974][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:148] cm init: all clusters initialized
[2020-05-21 03:19:17.974][1][info][main] [source/server/server.cc:500] all clusters initialized. initializing init manager
[2020-05-21 03:19:17.977][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 03:19:17.977][1][info][config] [source/server/listener_manager_impl.cc:761] all dependencies initialized. starting workers
[2020-05-21 03:22:56.703][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 03:27:38.116][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 03:34:17.981][1][info][main] [source/server/drain_manager_impl.cc:63] shutting down parent after drain
[2020-05-21 03:36:09.007][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 04:14:04.376][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'
[2020-05-21 04:15:07.144][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:30206'

but I had another issue, my health checks could be registered it was showing an error, so in order to continue I had to comment it so I could have my services and sidecar services up and running

just the mesh gateway which is deployed in both clusters, in cluster 1 (sfo) it didn't ask me for a token, but in cluster 2 it ask me for the token without it, it could get deployed and registered in consul.

so my nomad job looks like the following in cluster 2 (nyc)

job "cinemas" {

  datacenters = ["nyc-ncv"]
  region      = "nyc-region"
  type        = "service"

  group "payment-api" {
    count = 1

    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "payment-api"
      port = "3000"

      // check {
      //   name     = "payment-api-health"
      //   port     = "healthcheck"
      //   type     = "http"
      //   protocol = "http"
      //   path     = "/ping"
      //   interval = "5s"
      //   timeout  = "2s"
      //   expose   = true
      // }

      connect {
        sidecar_service {}
      }
    }

    task "payment-api" {
      driver = "docker"

      config {
        image = "crizstian/payment-service-go:v0.4"
      }

      env {
        DB_SERVERS      = "mongodb1.query.consul:27017,mongodb2.query.consul:27018,mongodb3.query.consul:27019"
        SERVICE_PORT    = "3000"
        CONSUL_IP       = "consul.service.consul"
        CONSUL_SCHEME   = "https"
        CONSUL_HTTP_SSL = "true"
      }

      resources {
        cpu    = 50
        memory = 50
      }
    }
  }

  group "notification-api" {
    count = 1

    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

      service {
      name = "notification-api"
      port = "3001"

      // check {
      //   name     = "notification-api-health"
      //   port     = "healthcheck"
      //   type     = "http"
      //   protocol = "http"
      //   path     = "/ping"
      //   interval = "5s"
      //   timeout  = "2s"
      //   expose   = true
      // }

      connect {
        sidecar_service {}
      }
    }

    task "notification-api" {
      driver = "docker"

      config {
        image   = "crizstian/notification-service-go:v0.4"
      }

      env {
        SERVICE_PORT    = "3001"
        CONSUL_IP       = "consul.service.consul"
        CONSUL_SCHEME   = "https"
        CONSUL_HTTP_SSL = "true"
      }

      resources {
        cpu    = 50
        memory = 50
      }
    }
  }

  group "mesh-gateway" {
    count = 1

    task "mesh-gateway" {
      driver = "raw_exec"

      config {
        command = "consul"
        args    = [
          "connect", "envoy",
          "-mesh-gateway",
          "-register",
          "-http-addr", "172.20.20.21:8500",
          "-grpc-addr", "172.20.20.21:8502",
          "-wan-address", "172.20.20.21:${NOMAD_PORT_proxy}",
          "-address", "172.20.20.21:${NOMAD_PORT_proxy}",
          "-bind-address", "default=172.20.20.21:${NOMAD_PORT_proxy}",
          "-token", "daca8d74-b0de-05bf-cc23-e095244e514e",
          "--",
          "-l", "debug"
        ]
      }

      resources {
        cpu    = 100
        memory = 100

        network {
          port "proxy" {
            static = 8433
          }
        }
      }
    }
  }
}

continuing with my mesh gateway configurations and testing I just got hit by another error, I couldn't register my consul intentions since my services doesn't appear in the dropdown menu, in cluster 1 (sfo) it only displays the services running in cluster 1 (sfo) to create an intention for source and destination, same behavior in cluster 2 (nyc) I couldn't see any service running in cluster 1 for creating an intention.

So now I am blocked with this, and I need to create this intentions in order to my services to communicate with service mesh, I have tried creating an intention with allow all to allow all, but this didn't work.

in previous consul versions I was able to create intentions and see the services in both clusters, something got changed here, that now I am not allowed.

idrennanvmware commented 4 years ago

Looks like we're both stuck at the same place now. We are using the Nicholas Jackson Consul-Envoy Proxy to register and unregister gateways via Nomad because we don't have access to the envoy binary on our images BUT the gateways do not unregister from Consul when we use this AND once we have registered a gateway we can't unregister it (manually) from Consul in our secondary cluster.

Regarding intentions @Crizstian - we set our intentions before doing this step https://learn.hashicorp.com/consul/developer-mesh/connect-gateways#configure-sidecar-proxies-to-use-gateways It seems if we do that step where we set the proxy-defaults then after that we start seeing token acl issues when navigating around the Consul UI between Datacenters (note this is just anecdotal as we're still trying to get to the bottom of why it suddenly starts giving errors on the UI)

We really are stuck at this stage and it doesn't seem like we're doing anything that is unique or funky so I'm at a loss on how to move forward. I'll keep working on this another day or 2 but we may have to look at alternative ways to achieve what we need if we can't get past this. I'll share what we find here, and @Crizstian thank you for sharing your steps in detail - seems we're on similar paths

idrennanvmware commented 4 years ago

@Crizstian - have you tried setting your intentions to star -> star and "Allow"? It seems if you set this in one DC it will show in the other too. I see the same issue you do where services in cluster A don't show as available in Cluster B for intentions. I've just done star->star for now

EDIT: I just used the command line and was able to make an intention between services in each datacenter BUT it still doesn't work. See Step 7 for outout from API call

So, summary of the steps.

  1. Create a primary-cluster. Bootstrap ACLS. For simplicity the master key is used for everything. This cluster also has NOMAD (Not federated but using Consul)

  2. Create a gateway-cluster that uses the Gossip Key from primary-cluster, and the ACL (Master) from primary-cluster. It has the datacenter name of itself, and the primary_datacenter name of the primary-cluster. This cluster also has NOMAD (not federated but using Consul)

  3. Verify federation is happening per "consul members -wan" AND replication is working by running the following curl on the gateway-cluster: curl --request GET http://127.0.0.1:8500/v1/acl/replication

  4. Run the mesh gateway jobs with the command In primary-cluster: consul connect envoy -mesh-gateway -register -service "gateway-primary" -address ":23000" -wan-address "HOSTIP:23000" -admin-bind 127.0.0.1:19005 -token=MASTER_TOKEN

In gateway-cluster consul connect envoy -mesh-gateway -register -service "gateway-secondary" -address ":23000" -wan-address "HOSTIP:23000" -admin-bind 127.0.0.1:19005 -token=MASTER_TOKEN

4.Ensure Intentions are set to ALLOW star-star (note for this example I also explicitly made another across dc with consul intention create api database)

  1. Run the Database Job in primary-cluster

    job "database-backend" { datacenters = ["primary-cluster"] type = "service"

    group "database" { count = 1

    network {
        mode = "bridge"
    }
    
    service {
        name = "database"
        port = "25432"
    
        connect {
            sidecar_service {}
        }
    }
    
    task "database" {
        driver = "docker"
    
        config {
            image = "nicholasjackson/fake-service:v0.9.0"
        }
    
        env {
            NAME = "database"
            MESSAGE = "ok"
            LISTEN_ADDR = "0.0.0.0:25432"
            TIMING_VARIANCE = "25"
            HTTP_CLIENT_KEEP_ALIVES = "true"
        }
    
        resources {
            cpu    = 100
            memory = 256
        }
    }

    } }

  2. Run the api job in gateway-cluster

job "api-frontend" { datacenters = ["gateway-cluster"] type = "service" group "api" { count = 1 network { mode = "bridge" port "http" { to = 9090 static = 20345 } } service { name = "api" port = "9090" connect { sidecar_service { proxy { upstreams { destination_name = "database" local_bind_port = 25432 } } } } } task "api" { driver = "docker" config { image = "nicholasjackson/fake-service:v0.9.0" } env { NAME = "api" MESSAGE = "ok" LISTEN_ADDR = "0.0.0.0:9090" UPSTREAM_URIS = "http://localhost:25432" TIMING_VARIANCE = "25" HTTP_CLIENT_KEEP_ALIVES = "true" } resources { cpu = 100 memory = 256 } } } }

  1. Output is as described

{ "name": "api", "uri": "/", "type": "HTTP", "ip_addresses": [ "172.26.64.3" ], "start_time": "2020-05-21T14:37:17.297829", "end_time": "2020-05-21T14:37:17.299712", "duration": "1.883618ms", "Headers": null, "upstream_calls": [ { "uri": "http://localhost:25432", "Headers": null, "code": -1, "error": "Error communicating with upstream service: Get http://localhost:25432/: EOF" } ], "code": 500 }

idrennanvmware commented 4 years ago

Just a small addition. The countdash app works on both the gateway-cluster and the primary-cluster (long as the call doesn't try to get proxied over the mesh gateway). So basically in the context of each DC Mesh is working fine, but DC to DC just doesn't work

crizstian commented 4 years ago

i have found another issue, I have deployed my frontend services in cluster 1 (sfo) and my backend services in cluster 2 (nyc) and in cluster 1 (sfo) my containers can ping any service.consul

but in cluster 2 (nyc) the behavior is different, inside the containers I am no longer able to ping / dig / make any kind of request to services with the .consul domain which means my service discovery doesn't work.

here is the output of the test made in cluster 2 (nyc)

root@nyc-consul-server:/home/vagrant# nomad plan /vagrant/deployment-files/mesh-gw/cinemas.dc2.hcl
+ Job: "cinemas"
+ Task Group: "mesh-gateway" (1 create)
  + Task: "mesh-gateway" (forces create)

+ Task Group: "notification-api" (1 create/destroy update)
  + Task: "connect-proxy-notification-api" (forces create)
  + Task: "notification-api" (forces create)

+ Task Group: "payment-api" (1 create/destroy update)
  + Task: "connect-proxy-payment-api" (forces create)
  + Task: "payment-api" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 /vagrant/deployment-files/mesh-gw/cinemas.dc2.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
root@nyc-consul-server:/home/vagrant# nomad job run -check-index 0 /vagrant/deployment-files/mesh-gw/cinemas.dc2.hcl
==> Monitoring evaluation "d51f4331"
    Evaluation triggered by job "cinemas"
    Evaluation within deployment: "bbcb6b10"
    Allocation "c40cfdbc" created: node "3e5954a5", group "mesh-gateway"
    Allocation "d30bf9af" created: node "3e5954a5", group "payment-api"
    Allocation "e2716db5" created: node "3e5954a5", group "notification-api"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "d51f4331" finished with status "complete"
root@nyc-consul-server:/home/vagrant# nomad status cinemas
ID            = cinemas
Name          = cinemas
Submit Date   = 2020-05-21T18:19:30Z
Type          = service
Priority      = 50
Datacenters   = nyc-ncv
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group        Queued  Starting  Running  Failed  Complete  Lost
mesh-gateway      0       0         1        0       0         0
notification-api  0       0         1        0       0         0
payment-api       0       0         1        0       0         0

Latest Deployment
ID          = bbcb6b10
Status      = running
Description = Deployment is running

Deployed
Task Group        Desired  Placed  Healthy  Unhealthy  Progress Deadline
mesh-gateway      1        1       0        0          2020-05-21T18:29:30Z
notification-api  1        1       0        0          2020-05-21T18:29:30Z
payment-api       1        1       0        0          2020-05-21T18:29:30Z

Allocations
ID        Node ID   Task Group        Version  Desired  Status   Created  Modified
c40cfdbc  3e5954a5  mesh-gateway      0        run      running  5s ago   5s ago
d30bf9af  3e5954a5  payment-api       0        run      running  5s ago   3s ago
e2716db5  3e5954a5  notification-api  0        run      running  5s ago   4s ago
root@nyc-consul-server:/home/vagrant# nomad alloc exec -task notification-api e2716db5 sh
/tmp # ping consul.service.consul
ping: bad address 'consul.service.consul'
/tmp # ping mongodb1.query.consul
ping: bad address 'mongodb1.query.consul'
/tmp # exit
root@nyc-consul-server:/home/vagrant# ping consul.service.consul
PING consul.service.consul (172.20.20.21) 56(84) bytes of data.
64 bytes from 172.20.20.21: icmp_seq=1 ttl=64 time=0.275 ms
64 bytes from 172.20.20.21: icmp_seq=2 ttl=64 time=0.073 ms
^C
--- consul.service.consul ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.073/0.174/0.275/0.101 ms
root@nyc-consul-server:/home/vagrant# ping mongodb.query.consul
ping: unknown host mongodb.query.consul
root@nyc-consul-server:/home/vagrant# ping mongodb1.query.consul
ping: unknown host mongodb1.query.consul
root@nyc-consul-server:/home/vagrant# nomad logs e2716db5 notification-api
CONSUL_SCHEME=https
CONSUL_HTTP_SSL=true
CONSUL_IP=consul.service.consul
CONSUL_PORT=8500
SSL IS ENABLED
Setting consul token
Setting custom command as the startup command
Setting secrets for notification-service if exists
Fetching role and secrets...
curl -s --cacert /tmp/ca.crt.pem --header "X-Consul-Token: " https://consul.service.consul:8500/v1/kv/cluster/apps/notification-service/auth/role_id
curl -s --cacert /tmp/ca.crt.pem --header "X-Consul-Token: " https://consul.service.consul:8500/v1/kv/cluster/apps/notification-service/auth/secret_id
role_id =
secret_id =
Generating secrets template file...
consul {
  address = "consul.service.consul:8500"

  ssl {
    enabled = true
    ca_cert = "/tmp/ca.crt.pem"
  }
}

vault {
  address     = "https://vault.service.consul:8200"
  token       = ""
  renew_token = false

  ssl {
    ca_cert = "/tmp/ca.crt.pem"
  }
}

secret {
  no_prefix = true
  path   = "secret/notification-service"
}Continuing with envconsul based startup...
root@nyc-consul-server:/home/vagrant# nomad logs -stderr e2716db5 notification-api
2020/05/21 18:19:41.092884 [WARN] (view) vault.read(secret/notification-service): vault.read(secret/notification-service): Get https://vault.service.consul:8200/v1/secret/notification-service: dial tcp: lookup vault.service.consul on 8.8.8.8:53: no such host (retry attempt 1 after "250ms")

know I now why my health checks were failing at the first time; so indeed they were working; now I will try by hardcoding consul / vault ips and see if I can have this work around to get it working

and this is the result of it my services started working

root@nyc-consul-server:/vagrant/provision/terraform/tf_cluster/secondary# nomad logs 2e8a5ebf notification-api
CONSUL_SCHEME=https
CONSUL_HTTP_SSL=true
CONSUL_IP=172.20.20.21
CONSUL_PORT=8500
SSL IS ENABLED
Setting consul token
Setting custom command as the startup command
Setting secrets for notification-service if exists
Fetching role and secrets...
curl -s --cacert /tmp/ca.crt.pem --header "X-Consul-Token: " https://172.20.20.21:8500/v1/kv/cluster/apps/notification-service/auth/role_id
curl -s --cacert /tmp/ca.crt.pem --header "X-Consul-Token: " https://172.20.20.21:8500/v1/kv/cluster/apps/notification-service/auth/secret_id
role_id = 1b745c14-e624-36de-1c51-1117d2a40174
secret_id = 2f90e3c9-1ec1-864c-2f54-e16b9b462ed6
Exporting VAULT_TOKEN
Vault request
curl --cacert /tmp/ca.crt.pem -s -X POST -d '{role_id:$role_id,secret_id:$secret_id}' https://172.20.20.11:8200/v1/auth/approle/login
Generating secrets template file...
consul {
  address = "172.20.20.21:8500"

  ssl {
    enabled = true
    ca_cert = "/tmp/ca.crt.pem"
  }
}

vault {
  address     = "https://172.20.20.11:8200"
  token       = "s.9nj8zdSknTd3Hum59UYmGIAy"
  renew_token = false

  ssl {
    ca_cert = "/tmp/ca.crt.pem"
  }
}

secret {
  no_prefix = true
  path   = "secret/notification-service"
}Continuing with envconsul based startup...
is CONSUL_HTTP_SSL = true
Setting consul token
Checking cluster state - active or standby
........................Cluster is in active state
Proceeding with startup...
time="2020-05-21T18:45:16Z" level=info msg="--- Notification Service ---"
time="2020-05-21T18:45:16Z" level=info msg="Connecting to notification repository..."
time="2020-05-21T18:45:16Z" level=info msg="Connected to Notification Repository"
time="2020-05-21T18:45:16Z" level=info msg="Starting Notification Service now ..."

   ____    __
  / __/___/ /  ___
 / _// __/ _ \/ _ \
/___/\__/_//_/\___/ v3.3.10-dev
High performance, minimalist Go web framework
https://echo.labstack.com
____________________________________O/_______
                                    O\
⇨ http server started on [::]:3001

unfortunately for my service connecting to the database in cluster 1 (sfo) through prepared queries it couldn't t make it

Checking cluster state - active or standby
Cluster is in active state
Proceeding with startup...
time="2020-05-21T18:47:28Z" level=info msg="--- Payment Service ---"
time="2020-05-21T18:47:28Z" level=info msg="connecting to db ...."

and once again I had to hardcode the address in order to work

time="2020-05-21T18:47:37Z" level=info msg="--- Payment Service ---"
time="2020-05-21T18:47:37Z" level=info msg="connecting to db ...."
time="2020-05-21T18:47:37Z" level=info msg="Connected to Payment Service DB"
time="2020-05-21T18:47:37Z" level=info msg="Connecting to payment repository..."
time="2020-05-21T18:47:37Z" level=info msg="Connected to Payment Repository"
time="2020-05-21T18:47:37Z" level=info msg="Starting Payment Service now ..."

   ____    __
  / __/___/ /  ___
 / _// __/ _ \/ _ \
/___/\__/_//_/\___/ v3.3.10-dev
High performance, minimalist Go web framework
https://echo.labstack.com
____________________________________O/_______
                                    O\
⇨ http server started on [::]:3000

So there is another problem discovered here:

when there is a federated cluster with ACLs enabled

so i will give it a last try going with ACLs but in allow mode to see what's the behavior and if it is the same I will be very disappointed to not use ACL system properly with connect on a federated cluster.

any thoughts @shoenig ??

crizstian commented 4 years ago

And voilaaaa I made it work consul service mesh with gateways and ACLs enabled but with to much tweaking and with a service discovery that doesn't work for me in my second cluster (nyc)

this are my services in cluster 1 (sfo)

Captura de Pantalla 2020-05-21 a la(s) 13 58 28

and this are my services in cluster 2 (nyc)

Captura de Pantalla 2020-05-21 a la(s) 13 59 28

this is my job file deployed in cluster 1 (sfo)

job "cinemas" {

  datacenters = ["sfo-ncv"]
  region      = "sfo-region"
  type        = "service"

  group "booking-api" {
    count = 1

    network {
      mode = "bridge"

      port "http" {
        static = 3002
        to     = 3002
      }

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "booking-api"
      port = "http"
      tags = ["cinemas-project"]

      check {
        name     = "booking-api-health"
        port     = "healthcheck"
        type     = "http"
        protocol = "http"
        path     = "/ping"
        interval = "10s"
        timeout  = "3s"
        expose   = true
      }

      connect {
        sidecar_service {
          proxy {
            upstreams {
               destination_name = "payment-api"
               local_bind_port = 8080
            }
            upstreams {
               destination_name = "notification-api"
               local_bind_port = 8081
            }
          }
        }
      }
    }

    task "booking-api" {
      driver = "docker"

      config {
        image   = "crizstian/booking-service-go:v0.4"
      }

      env {
        SERVICE_PORT     = "3002"
        DB_SERVERS       = "mongodb1.query.consul:27017,mongodb2.query.consul:27018,mongodb3.query.consul:27019"

        CONSUL_IP        = "consul.service.consul"
        CONSUL_SCHEME   = "https"
        CONSUL_HTTP_SSL = "true"

        PAYMENT_URL      = "http://${NOMAD_UPSTREAM_ADDR_payment_api}"
        NOTIFICATION_URL = "http://${NOMAD_UPSTREAM_ADDR_notification_api}"
      }

      resources {
        cpu    = 50
        memory = 50
      }
    }
  }

  group "mesh-gateway" {
    count = 1

    task "mesh-gateway" {
      driver = "raw_exec"

      config {
        command = "consul"
        args    = [
          "connect", "envoy",
          "-mesh-gateway",
          "-register",
          "-service", "gateway-primary",
          "-address", ":${NOMAD_PORT_proxy}",
          "-wan-address", "172.20.20.11:${NOMAD_PORT_proxy}",
          "-admin-bind", "127.0.0.1:19005",
          "-token", "daca8d74-b0de-05bf-cc23-e095244e514e",
          "-deregister-after-critical", "5s",
          "--",
          "-l", "debug"
        ]
      }

      resources {
        cpu    = 100
        memory = 100

        network {
          port "proxy" {
            static = 8433
          }
        }
      }
    }
  }
}

and this is my job file deployed in cluster 2 (nyc)

job "cinemas" {

  datacenters = ["nyc-ncv"]
  region      = "nyc-region"
  type        = "service"

  group "payment-api" {
    count = 1

    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "payment-api"
      port = "3000"

      check {
        name     = "payment-api-health"
        port     = "healthcheck"
        type     = "http"
        protocol = "http"
        path     = "/ping"
        interval = "5s"
        timeout  = "2s"
        expose   = true
      }

      connect {
        sidecar_service {}
      }
    }

    task "payment-api" {
      driver = "docker"

      config {
        image = "crizstian/payment-service-go:v0.4"
      }

      env {
        DB_SERVERS      = "192.168.15.6:27017,192.168.15.6:27018,192.168.15.6:27019"
        SERVICE_PORT    = "3000"
        CONSUL_IP       = "172.20.20.21"
        CONSUL_SCHEME   = "https"
        CONSUL_HTTP_SSL = "true"
        VAULT_ADDR       = "https://172.20.20.11:8200"
      }

      resources {
        cpu    = 50
        memory = 50
      }
    }
  }

  group "notification-api" {
    count = 1

    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

      service {
      name = "notification-api"
      port = "3001"

      check {
        name     = "notification-api-health"
        port     = "healthcheck"
        type     = "http"
        protocol = "http"
        path     = "/ping"
        interval = "5s"
        timeout  = "2s"
        expose   = true
      }

      connect {
        sidecar_service {}
      }
    }

    task "notification-api" {
      driver = "docker"

      config {
        image   = "crizstian/notification-service-go:v0.4"
      }

      env {
        SERVICE_PORT    = "3001"
        CONSUL_IP       = "172.20.20.21"
        CONSUL_SCHEME   = "https"
        CONSUL_HTTP_SSL = "true"
        VAULT_ADDR       = "https://172.20.20.11:8200"
      }

      resources {
        cpu    = 50
        memory = 50
      }
    }
  }

  group "mesh-gateway" {
    count = 1

    task "mesh-gateway" {
      driver = "raw_exec"

      config {
        command = "consul"
        args    = [
          "connect", "envoy",
          "-mesh-gateway",
          "-register",
          "-service", "gateway-secondary",
          "-address", ":${NOMAD_PORT_proxy}",
          "-wan-address", "172.20.20.21:${NOMAD_PORT_proxy}",
          "-admin-bind", "127.0.0.1:19005",
          "-token", "daca8d74-b0de-05bf-cc23-e095244e514e",
          "-deregister-after-critical", "5s",
        ]
      }

      resources {
        cpu    = 100
        memory = 100

        network {
          port "proxy" {
            static = 8433
          }
        }
      }
    }
  }
}

i have deployed also the following configurations:

root@sfo-consul-server:/vagrant/provision/terraform/tf_cluster/primary# consul config list -kind service-defaults
booking-service
notification-api
payment-api
root@sfo-consul-server:/vagrant/provision/terraform/tf_cluster/primary# consul config list -kind proxy-defaults
global

they pretty much share almost the same config for the consul central config, I deployed this with terraform consul provider

variable "service_defaults_apps" {
  default = [
    {
      name           = "booking-service"
      mesh_resolver  = "local"
    },
    {
      name             = "notification-api"
      mesh_resolver    = "local"
      service_resolver = {
        DefaultSubset = "v1"
        Subsets = {
          "v1" = {
            Filter = "Service.Meta.version == v1"
          }
          "v2" = {
            Filter = "Service.Meta.version == v2"
          }
        }
        Failover = {
          "*" = {
            Datacenters = ["nyc"]
          }
        }
      }
    },
    {
      name             = "payment-api"
      mesh_resolver    = "local"
      service_resolver = {
        Failover = {
          "*" = {
            Datacenters = ["nyc"]
          }
        }
      }
    },
    {
      name             = "count-dashboard"
      mesh_resolver    = "local"
    },
    {
      name             = "count-api"
      mesh_resolver    = "local"
    }
  ]
}

variable "proxy_defaults" {
  default = {
      MeshGateway = {
        Mode = "local"
    }
  }
}

variable "enable_service_defaults" {
  default = false
}
variable "app_config_services" {
  default = []
}

resource "consul_config_entry" "service-defaults" {
  count = var.enable_service_defaults && length(var.app_config_services) > 0 ? length(var.app_config_services) : 0

  name = var.app_config_services[count.index].name
  kind = "service-defaults"

  config_json = jsonencode({
    Protocol    = "http"
    MeshGateway = {
      Mode = var.app_config_services[count.index].mesh_resolver
    }
  })
}

the services still doesn't appear in the dropdown menu in either datacenter; so I had to enabled it with a start - star = allow intetion.

So I think we are almost there to have this consul properly working; now my problem is that in cluster 2 (nyc) the service discovery is not working which is huge huge problem; since that is the principal offer that consul brings to us.

idrennanvmware commented 4 years ago

Thanks for sharing that @Crizstian! Looks like the big catch was "service-defaults" ?

Can someone from Hashicorp comment on all of the above? This is a huge feature that's being highlighted but seems like it's extremely easy to get set up incorrectly and also that there are gaps in the documentation. Would be good to get some sort of official response on what should be happening here.

As i read the docs, the "proxy-default" should be enough

crizstian commented 4 years ago

Probably @nickethier or @lkysow from Hashicorp any comments about this ?

idrennanvmware commented 4 years ago

@Crizstian - I decided to take nomad out of the mix and run just all this stuff native consul - no dice following even this tutorial

https://learn.hashicorp.com/consul/developer-mesh/connect-gateways

idrennanvmware commented 4 years ago

If I add a service-resolver to the config, I can get this to work too.

I verified if I shut the gateway down in the primary cluster, things stop working downstream (so I know it's actually using the gateway), and when I start the gateway up again then it all works.

WHEW!

kind = "service-resolver" name = "database-proxy" redirect { service = "database" datacenter = "primary-cluster" }

crizstian commented 4 years ago

I Have tried everything from scratch and doing everything step by step, since I had done this before I had almost everything automated, so this are my steps and my founds.

1.- create two consul /nomad clusters 2.- bootstrap the primary datacenter 3.- set default and replication tokens with the primary consul root token 4.- configure secondary cluster; federate second cluster with retry_wan_join and set the default and replication token with the primary root token

until here everything look ok

5.- deployed my frontend services in primary cluster 6.- deployed with the same job the primary mesh-gateway

first problem found here; I deploy my mesh gateway using nomad as another service with the raw_exec plugin like the following:

  group "mesh-gateway" {
    count = 1

    task "mesh-gateway" {
      driver = "raw_exec"

      config {
        command = "consul"
        args    = [
          "connect", "envoy",
          "-mesh-gateway",
          "-register",
          "-service", "gateway-primary",
          "-address", ":${NOMAD_PORT_proxy}",
          "-wan-address", "172.20.20.11:${NOMAD_PORT_proxy}",
          "-admin-bind", "127.0.0.1:19005",
          "-token", "c6759f14-1005-675c-1db6-18132ada0a39",
          "-deregister-after-critical", "5s",
          "--",
          "-l", "debug"
        ]
      }

      resources {       
        network {
          port "proxy" {
            static = 8433
          }
        }
      }
    }
  }

how can I avoid hardcoding the token value; my token is set an environment variable, I would like that the raw_exec read the token value from the host environment variables.

7.- I deployed my backend services in secondary cluster and the secondary mesh gateway with same process from primary.

Then for configuring the consul central configs to enable consul connect with federation I have a terraform code to create all the consul configs required; so I could have it in an automated and versioned controlled way of doing this.

because of creating all this configs with terraform and seeing that there were properly created I thought everything was ok; so I ignore the terraform consul behavior. after one week struggling to make service mesh with mesh gateways to work; I discovered that the terraform consul provider doesn't work for consul intentions and consul central configs. first I discovered that consul intentions wasn't working so I created an issue in the terraform consul provider repo: https://github.com/terraform-providers/terraform-provider-consul/issues/194

and today I discovered that also consul central config are also not working when created with terraform. so I will also report this in that issue.

8.- i had to create consul central config by using consul cli

 - consul config write /vagrant/provision/consul/central_config/mesh-gateway/notification-api-defaults.hcl
 - consul config write /vagrant/provision/consul/central_config/mesh-gateway/booking-defaults.hcl
 - consul config write /vagrant/provision/consul/central_config/mesh-gateway/payment-api-defaults.hcl
 - consul config write /vagrant/provision/consul/central_config/failover/notification-api-resolver.hcl
 -  consul config write /vagrant/provision/consul/central_config/failover/payment-api-resolver.hcl

9.- Create consul intentions in consul cli / ui, since it doesn't work with terraform code as well

but here I had encounter another problem. I can't see all the services running in both datacenters in the consul intention dropdown menu, so I had to create a star -> star, action allow.

after doing this I could finally be able to setup all the consul connect features and mesh gateways as well and my frontend services were able to communicate to the backend services deployed in different datacenters.

Apparently there is only one issue with Nomad, and is how can my raw_exec for the consul mesh gateway process to read my consul token from the host environment values.

So this is one Consul issue as well, which is on the consul intentions part, consul is not showing all my services and I am not able to create consul intentions properly, by specifying which service to which service allowed to talk.

And finally there are a lot of problems for Terraform Consul provider which is not creating the specified configurations set in the terraform code, which I am a little bit disappointed because I am not able to automate this consul configurations.

I hope this can get fixed soon or perhaps I am missing some silly steps which is leading me to these errors.

I hope I can continue with my testing for the new gateways for consul connect.

crizstian commented 4 years ago

Hi @shoenig here are my config files as you requested in the Nomad office community hours in YouTube

Consul Files consul.hcl.tmpl

data_dir = "/var/consul/config/"
log_level = "DEBUG"

datacenter         = "{{ env "DATACENTER" }}"
primary_datacenter = "{{ env "PRIMARY_DATACENTER" }}"

ui     = true
server = true
bootstrap_expect = {{ env "CONSUL_SERVERS" }}

bind_addr   = "0.0.0.0"
client_addr = "0.0.0.0"

ports {
  grpc  = 8502
  https = {{ if eq (env "CONSUL_SSL") "true" }}{{ env "CONSUL_PORT" }}{{ else }}-1{{end}}
  http  = {{ if eq (env "CONSUL_SSL") "true" }}-1{{ else }}{{ env "CONSUL_PORT" }}{{end}}
}

advertise_addr     = "{{ env "HOST_IP" }}"
advertise_addr_wan = "{{ env "HOST_IP" }}"

{{ if eq (env "DATACENTER") (env "PRIMARY_DATACENTER") }}{{else}}
retry_join_wan = {{ env "HOST_LIST" }}
{{end}}

enable_central_service_config = true

connect {
  enabled = true
}

acl = {
  enabled        = true
  default_policy = "deny"
  down_policy    = "extend-cache"
  # enable_token_persistence = true
  {{ if eq (env "DATACENTER") (env "PRIMARY_DATACENTER") }}{{else}}
  enable_token_replication = true
  {{end}}
  tokens = {
    default     = "{{ env "CONSUL_HTTP_TOKEN" }}"
    replication = "{{ env "CONSUL_HTTP_TOKEN" }}"
  }
}

verify_incoming        = false
verify_incoming_rpc    = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
verify_outgoing        = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
verify_server_hostname = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}

auto_encrypt = {
  allow_tls = {{ if eq (env "CONSUL_SSL") "true" }}true{{ else }}false{{end}}
}

{{ if eq (env "CONSUL_SSL") "true" }}
ca_file    = "{{ env "CONSUL_CACERT" }}"
cert_file  = "{{ env "CONSUL_CLIENT_CERT" }}"
key_file   = "{{ env "CONSUL_CLIENT_KEY" }}"
{{end}}

encrypt = "{{ env "CONSUL_ENCRYPT_KEY" }}"
encrypt_verify_incoming = true
encrypt_verify_outgoing = true

telemetry = {
  dogstatsd_addr   = "10.0.2.15:8125"
  disable_hostname = true
}

this is the base file for both datacenters and for datacenter 1 consul-template renders it into the following

consul.hcl

data_dir = "/var/consul/config/"
log_level = "DEBUG"

datacenter         = "sfo"
primary_datacenter = "sfo"

ui     = true
server = true
bootstrap_expect = 1

bind_addr   = "0.0.0.0"
client_addr = "0.0.0.0"

ports {
  grpc  = 8502
  https = 8500
  http  = -1
}

advertise_addr     = "172.20.20.11"
advertise_addr_wan = "172.20.20.11"

enable_central_service_config = true

connect {
  enabled = true
}

acl = {
  enabled        = true
  default_policy = "deny"
  down_policy    = "extend-cache"
  # enable_token_persistence = true

  tokens = {
    default     = "45777651-66a1-4042-9479-cbcce7c775ac"
    replication = "45777651-66a1-4042-9479-cbcce7c775ac"
  }
}

verify_incoming        = false
verify_incoming_rpc    = true
verify_outgoing        = true
verify_server_hostname = true

auto_encrypt = {
  allow_tls = true
}

ca_file    = "/var/vault/config/ca.crt.pem"
cert_file  = "/var/vault/config/server.crt.pem"
key_file   = "/var/vault/config/server.key.pem"

encrypt = "apEfb4TxRk3zGtrxxAjIkwUOgnVkaD88uFyMGHqKjIw="
encrypt_verify_incoming = true
encrypt_verify_outgoing = true

telemetry = {
  dogstatsd_addr   = "10.0.2.15:8125"
  disable_hostname = true
}

this is the final render once I have done the consul bootstrap acl; before this it is the same render but without the consul token.

and for datacenter 2 this is the final render as well consul.hcl

data_dir = "/var/consul/config/"
log_level = "DEBUG"

datacenter         = "nyc"
primary_datacenter = "sfo"

ui     = true
server = true
bootstrap_expect = 1

bind_addr   = "0.0.0.0"
client_addr = "0.0.0.0"

ports {
  grpc  = 8502
  https = 8500
  http  = -1
}

advertise_addr     = "172.20.20.21"
advertise_addr_wan = "172.20.20.21"

retry_join_wan = ["172.20.20.11","172.20.20.21"]

enable_central_service_config = true

connect {
  enabled = true
}

acl = {
  enabled        = true
  default_policy = "deny"
  down_policy    = "extend-cache"
  # enable_token_persistence = true

  enable_token_replication = true

  tokens = {
    default     = "45777651-66a1-4042-9479-cbcce7c775ac"
    replication = "45777651-66a1-4042-9479-cbcce7c775ac"
  }
}

verify_incoming        = false
verify_incoming_rpc    = true
verify_outgoing        = true
verify_server_hostname = true

auto_encrypt = {
  allow_tls = true
}

ca_file    = "/var/vault/config/ca.crt.pem"
cert_file  = "/var/vault/config/server.crt.pem"
key_file   = "/var/vault/config/server.key.pem"

encrypt = "apEfb4TxRk3zGtrxxAjIkwUOgnVkaD88uFyMGHqKjIw="
encrypt_verify_incoming = true
encrypt_verify_outgoing = true

telemetry = {
  dogstatsd_addr   = "10.0.2.15:8125"
  disable_hostname = true
}

My nomad files are the following nomad.hcl.tmpl

bind_addr  = "{{ env "HOST_IP" }}"
datacenter =  "{{ env "DATACENTER" }}-ncv"
region     =  "{{ env "DATACENTER" }}-region"
data_dir   = "/var/nomad/data"
log_level  = "DEBUG"

leave_on_terminate   = true
leave_on_interrupt   = true
disable_update_check = true

client {
    enabled = true
    host_volume "ca-certificates" {
        path      = "/var/vault/config"
        read_only = true
    }
}
addresses {
    rpc  = "{{ env "HOST_IP" }}"
    http = "{{ env "HOST_IP" }}"
    serf = "{{ env "HOST_IP" }}"
}
advertise {
    http = "{{ env "HOST_IP" }}:4646"
    rpc  = "{{ env "HOST_IP" }}:4647"
    serf = "{{ env "HOST_IP" }}:4648"
}
consul {
    address = "{{ env "HOST_IP" }}:8500"

    client_service_name = "nomad-{{ env "DATACENTER" }}-client"
    server_service_name = "nomad-{{ env "DATACENTER" }}-server"

    auto_advertise      = true
    server_auto_join    = true
    client_auto_join    = true

    ca_file    = "{{ env "CONSUL_CACERT" }}"
    cert_file  = "{{ env "CONSUL_CLIENT_CERT" }}"
    key_file   = "{{ env "CONSUL_CLIENT_KEY" }}"
    ssl        = {{ env "CONSUL_SSL" }}
    verify_ssl = {{ env "CONSUL_SSL" }}

    token   = "{{ env "CONSUL_HTTP_TOKEN" }}"
}

server {
    enabled = true
    bootstrap_expect = {{ env "NOMAD_SERVERS" }}
}

tls {
    http = true
    rpc  = true

    ca_file    = "{{ env "NOMAD_CACERT" }}"
    cert_file  = "{{ env "NOMAD_CLIENT_CERT" }}"
    key_file   = "{{ env "NOMAD_CLIENT_KEY" }}"

    verify_https_client    = false
    verify_server_hostname = true
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

and the rendered file for datacenter 1 is the following nomad.hcl

bind_addr  = "172.20.20.11"
datacenter =  "sfo-ncv"
region     =  "sfo-region"
data_dir   = "/var/nomad/data"
log_level  = "DEBUG"

leave_on_terminate   = true
leave_on_interrupt   = true
disable_update_check = true

client {
    enabled = true
    host_volume "ca-certificates" {
        path      = "/var/vault/config"
        read_only = true
    }
}
addresses {
    rpc  = "172.20.20.11"
    http = "172.20.20.11"
    serf = "172.20.20.11"
}
advertise {
    http = "172.20.20.11:4646"
    rpc  = "172.20.20.11:4647"
    serf = "172.20.20.11:4648"
}
consul {
    address = "172.20.20.11:8500"

    client_service_name = "nomad-sfo-client"
    server_service_name = "nomad-sfo-server"

    auto_advertise      = true
    server_auto_join    = true
    client_auto_join    = true

    ca_file    = "/var/vault/config/ca.crt.pem"
    cert_file  = "/var/vault/config/server.crt.pem"
    key_file   = "/var/vault/config/server.key.pem"
    ssl        = true
    verify_ssl = true

    token   = "45777651-66a1-4042-9479-cbcce7c775ac"
}

server {
    enabled = true
    bootstrap_expect = 1
}

tls {
    http = true
    rpc  = true

    ca_file    = "/var/vault/config/ca.crt.pem"
    cert_file  = "/var/vault/config/server.crt.pem"
    key_file   = "/var/vault/config/server.key.pem"

    verify_https_client    = false
    verify_server_hostname = true
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

this is rendered after consul process is done and the second datacenter nomad file is nomad.hcl

bind_addr  = "172.20.20.21"
datacenter =  "nyc-ncv"
region     =  "nyc-region"
data_dir   = "/var/nomad/data"
log_level  = "DEBUG"

leave_on_terminate   = true
leave_on_interrupt   = true
disable_update_check = true

client {
    enabled = true
    host_volume "ca-certificates" {
        path      = "/var/vault/config"
        read_only = true
    }
}
addresses {
    rpc  = "172.20.20.21"
    http = "172.20.20.21"
    serf = "172.20.20.21"
}
advertise {
    http = "172.20.20.21:4646"
    rpc  = "172.20.20.21:4647"
    serf = "172.20.20.21:4648"
}
consul {
    address = "172.20.20.21:8500"

    client_service_name = "nomad-nyc-client"
    server_service_name = "nomad-nyc-server"

    auto_advertise      = true
    server_auto_join    = true
    client_auto_join    = true

    ca_file    = "/var/vault/config/ca.crt.pem"
    cert_file  = "/var/vault/config/server.crt.pem"
    key_file   = "/var/vault/config/server.key.pem"
    ssl        = true
    verify_ssl = true

    token   = "45777651-66a1-4042-9479-cbcce7c775ac"
}

server {
    enabled = true
    bootstrap_expect = 1
}

tls {
    http = true
    rpc  = true

    ca_file    = "/var/vault/config/ca.crt.pem"
    cert_file  = "/var/vault/config/server.crt.pem"
    key_file   = "/var/vault/config/server.key.pem"

    verify_https_client    = false
    verify_server_hostname = true
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

Consul federation is done by the retry_join attribute in the second datacenter consul config file

what I am testing is the consul connect mesh gateway and I have 3 different services: 1 frontend service = booking service 2 backend services = payment and notification services 1 mesh gateway for each datacenter

my nomad jobs are cinemas.dc1.hcl

job "cinemas" {

  datacenters = ["sfo-ncv"]
  region      = "sfo-region"
  type        = "service"

  group "booking-api" {
    count = 1

    network {
      mode = "bridge"

      port "http" {
        static = 3002
        to     = 3002
      }

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "booking-api"
      port = "http"
      tags = ["cinemas-project"]

      check {
        name     = "booking-api-health"
        port     = "healthcheck"
        type     = "http"
        protocol = "http"
        path     = "/ping"
        interval = "10s"
        timeout  = "3s"
        expose   = true
      }

      connect {
        sidecar_service {
          proxy {
            upstreams {
               destination_name = "payment-api"
               local_bind_port = 8080
            }
            upstreams {
               destination_name = "notification-api"
               local_bind_port = 8081
            }
          }
        }
      }
    }

    task "booking-api" {
      driver = "docker"

      config {
        image   = "crizstian/booking-service-go:v0.4"
      }

      env {
        SERVICE_PORT     = "3002"
        DB_SERVERS       = "mongodb1.query.consul:27017,mongodb2.query.consul:27018,mongodb3.query.consul:27019"

        CONSUL_IP        = "consul.service.consul"
        CONSUL_SCHEME   = "https"
        CONSUL_HTTP_SSL = "true"

        PAYMENT_URL      = "http://${NOMAD_UPSTREAM_ADDR_payment_api}"
        NOTIFICATION_URL = "http://${NOMAD_UPSTREAM_ADDR_notification_api}"
      }

      resources {
        cpu    = 50
        memory = 50
      }
    }
  }

  group "mesh-gateway" {
    count = 1

    task "mesh-gateway" {
      driver = "raw_exec"

      config {
        command = "consul"
        args    = [
          "connect", "envoy",
          "-mesh-gateway",
          "-register",
          "-service", "gateway-primary",
          "-address", ":${NOMAD_PORT_proxy}",
          "-wan-address", "172.20.20.11:${NOMAD_PORT_proxy}",
          "-admin-bind", "127.0.0.1:19005",
          "-token", "c6759f14-1005-675c-1db6-18132ada0a39",
          "-deregister-after-critical", "5s",
          "--",
          "-l", "debug"
        ]
      }

      resources {
        cpu    = 100
        memory = 100

        network {
          port "proxy" {
            static = 8433
          }
        }
      }
    }
  }
}

and cinemas.dc2.hcl

job "cinemas" {

  datacenters = ["nyc-ncv"]
  region      = "nyc-region"
  type        = "service"

  group "payment-api" {
    count = 1

    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "payment-api"
      port = "3000"

      check {
        name     = "payment-api-health"
        port     = "healthcheck"
        type     = "http"
        protocol = "http"
        path     = "/ping"
        interval = "5s"
        timeout  = "2s"
        expose   = true
      }

      connect {
        sidecar_service {}
      }
    }

    task "payment-api" {
      driver = "docker"

      config {
        image = "crizstian/payment-service-go:v0.4"
      }

      env {
        DB_SERVERS      = "mongodb1.query.consul:27017,mongodb2.query.consul:27018,mongodb3.query.consul:27019"
        SERVICE_PORT    = "3000"
        CONSUL_IP       = "consul.service.consul"
        CONSUL_SCHEME   = "https"
        CONSUL_HTTP_SSL = "true"
      }

      resources {
        cpu    = 50
        memory = 50
      }
    }
  }

  group "notification-api" {
    count = 1

    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

      service {
      name = "notification-api"
      port = "3001"

      check {
        name     = "notification-api-health"
        port     = "healthcheck"
        type     = "http"
        protocol = "http"
        path     = "/ping"
        interval = "5s"
        timeout  = "2s"
        expose   = true
      }

      connect {
        sidecar_service {}
      }
    }

    task "notification-api" {
      driver = "docker"

      config {
        image   = "crizstian/notification-service-go:v0.4"
      }

      env {
        SERVICE_PORT    = "3001"
        CONSUL_IP       = "consul.service.consul"
        CONSUL_SCHEME   = "https"
        CONSUL_HTTP_SSL = "true"
      }

      resources {
        cpu    = 50
        memory = 50
      }
    }
  }

  group "mesh-gateway" {
    count = 1

    task "mesh-gateway" {
      driver = "raw_exec"

      config {
        command = "consul"
        args    = [
          "connect", "envoy",
          "-mesh-gateway",
          "-register",
          "-service", "gateway-secondary",
          "-address", ":${NOMAD_PORT_proxy}",
          "-wan-address", "172.20.20.21:${NOMAD_PORT_proxy}",
          "-admin-bind", "127.0.0.1:19005",
          "-token", "c6759f14-1005-675c-1db6-18132ada0a39",
          "-deregister-after-critical", "5s",
        ]
      }

      resources {
        cpu    = 100
        memory = 100

        network {
          port "proxy" {
            static = 8433
          }
        }
      }
    }
  }
}

as you can see in both datacenters I need to set my consul token for mesh gateway task and I will like not to hardcode it, I would like to read from my host env variables or why nomad doesn't use the token set in its configuration to register this service.

lkysow commented 4 years ago

@Crizstian re

So this is one Consul issue as well, which is on the consul intentions part, consul is not showing all my services and I am not able to create consul intentions properly, by specifying which service to which service allowed to talk.

This is a known issue: https://github.com/hashicorp/consul/issues/7390 however you can just type in the name of the service in the other DCs into the dropdown and the intention will still work.

crizstian commented 4 years ago

@Crizstian re

So this is one Consul issue as well, which is on the consul intentions part, consul is not showing all my services and I am not able to create consul intentions properly, by specifying which service to which service allowed to talk.

This is a known issue: hashicorp/consul#7390 however you can just type in the name of the service in the other DCs into the dropdown and the intention will still work.

@lkysow I tried what you said but I use the terraform code and it worked I believe it is not a huge problem if we are using terraform to create the intentions.

The only thing I see is that if my services are not registered in consul and I create first my intentions with terraform then deploy my services it doesn't work, but if I deploy first my services and the create my intentions it works, is this the expected behavior ?

lkysow commented 4 years ago

The only thing I see is that if my services are not registered in consul and I create first my intentions with terraform then deploy my services it doesn't work, but if I deploy first my services and the create my intentions it works, is this the expected behavior ?

You can create intentions without a service with that name existing so I don't think this is expected behaviour.