Closed dschaaff closed 3 years ago
FYI, I got a notification that issue label workflow failed on this.
Hi @dschaaff,
Thanks for reaching out! There are a few pieces of additional information that would help us dig into this further:
envoy
on your path (send output of envoy --version
)? We should make sure it's actually 1.16.4 (since -envoy-version=1.16.4
tells Consul how to interact with the running Envoy proxy, but has no effect on which version of Envoy is running).consul-k8s
. Can you share a scrubbed version of your values.yaml
file? envoy --version
envoy version: bf5d0eb44b781ac26ff1513700bcb114b7cf4300/1.16.4/Clean/RELEASE/BoringSSL
The main thing I notice in the envoy logs is that it only sets up the prometheus listener and the local admin admin listener, though I could definitely be missing something.
Hi @dschaaff :
Thank you for providing that information. We don't see any obvious issues in what you're provided so far, so we'd need some additional information to explore further:
While we don't expect the following will change anything for you, we think it might be worth a try: you could disable the streaming backend on agents by setting use_streaming_backend
to false
(so that traditional blocking queries are used instead) and see if anything changes. Defaulting to use of the streaming backend is something that changed between your two versions (1.9.5 and 1.10.1), which we don't expect to be related to what you are seeing, but could be.
@jkirschner-hashicorp
I disabled streaming on both servers and agents and did not see any change in behavior.
I'm attaching several files
Let me know if you need any information.
Appreciate it!
consul1.10_envoy_config_dump.txt consul_1.9_envoy_dump.txt consul-agent-1.10-log.txt consul-1.10-server.log
I should have done this sooner, but I went ahead and tested the behavior with agent versions 1.9.6, 1.9.7, and 1.9.8. Each of those 3 versions presents the same behavior as described in the ticket. No envoy upstream listeners are configured.
Once I downgrade to 1.9.5 everything starts working as expected. I've reviewed the changelogs and nothing jumps out at me as a major change. I'd appreciate any guidance on next steps to figure this out.
Ok, I am making progress. Envoy listeners are configured correctly if my sidecar service definition includes an acl token.
{
"service": {
"name": "ui-sidecar-proxy",
"id": "ui-sidecar-proxy",
"port": 20000,
"address": "10.20.203.196",
"kind": "connect-proxy",
"token": "redacted",
"proxy": {
"destination_service_id": "ui",
"destination_service_name": "ui",
"local_service_address": "127.0.0.1",
"local_service_port": 80,
"mode": "direct",
"upstreams": [
{
"destination_name": "upstream-1",
"destination_type": "service",
"local_bind_port": 2000
},
{
"destination_name": "upstream-2",
"destination_type": "service",
"local_bind_port": 3000
}
]
},
"meta": {
"availability_zone": "us-west-2a"
},
"checks": [
{
"deregister_critical_service_after": "10m",
"interval": "10s",
"name": "Proxy Public Listener",
"tcp": "10.20.203.196:20000"
},
{
"alias_service": "ui",
"name": "Destination Alias"
}
],
}
}
The clue was the RPC permission denied errors in the log. We previously did not include a token in the sidecar service definition. The service was registered using the consul agent's default token which had service:write
permission. The envoy config was then bootstrapped with a dedicated acl token with permission scoped to the service.
service "admin-ui" {
policy = "write"
}
service "admin-ui-sidecar-proxy" {
policy = "write"
}
service_prefix "" {
policy = "read"
}
node_prefix "" {
policy = "read"
}
See the systemd unit file in my original comment for where that token is passed. This setup worked until agent version 1.9.6. Starting with that version I have to have the token in the sidecar service registration for envoy to be configured correctly. This is also the reason are k8s injected pods were not configuring envoy correctly. The consul-k8s service template does not include the token field https://github.com/hashicorp/consul-k8s/blob/v0.25.0/connect-inject/container_init.go#L320
services {
id = "${SERVICE_ID}"
name = "admin-ui"
address = "${POD_IP}"
port = 80
meta = {
namespace = "admin-ui"
pod-name = "${POD_NAME}"
k8s-namespace = "${POD_NAMESPACE}"
}
}
services {
id = "${PROXY_SERVICE_ID}"
name = "admin-ui-sidecar-proxy"
kind = "connect-proxy"
address = "${POD_IP}"
port = 20000
meta = {
availability_zone = "us-west-2a"
namespace = "admin-ui"
pod-name = "${POD_NAME}"
k8s-namespace = "${POD_NAMESPACE}"
}
Is this expected behavior or a bug?
this behavior possibly changed due to https://github.com/hashicorp/consul/pull/10188
Thanks for tracking that down! It does seem likely that #10188 is related based on what you found. To help debug the problem, can you tell me more about which of these tokens (https://www.consul.io/docs/security/acl/acl-system#configuring-acls) you have configured?
You mentioned the acl.tokens.default
is configured with service:write
. Do you also have an acl.tokens.agent
configured, and if so which permissions does it have?
That PR (and #9683 which proceeded it) should only have changed the behaviour for de-registering services, but it sounds like that is not the case.
If you have the full error text of the RPC permission denied errors, that would also be helpful to debug.
This is my acl config on the ec2 based nodes
"acl": {
"default_policy": "deny",
"down_policy": "extend-cache",
"enable_token_persistence": true,
"enabled": true,
"token_ttl": "30s",
"tokens": {
"agent": "redacted",
"default": "redacted"
}
},
acl.tokens.default
and acl.tokens.agent
are both set the same token. That token has the following acl policy attached
node_prefix "" {
policy = "write"
}
service_prefix "" {
policy = "read"
}
service_prefix "" {
policy = "write"
}
service "" {
policy = "write"
}
I can successfully register the service and the proxy sidecar with the local agent without specifying a token in the sidecar service definition.
curl -s http://localhost:8500/v1/agent/services | python -m json.tool
{
"admin-ui": {
"Address": "10.20.212.252",
"Datacenter": "stg-datacenter",
"EnableTagOverride": false,
"ID": "admin-ui",
"Port": 0,
"Service": "admin-ui",
"SocketPath": "",
"TaggedAddresses": {
"lan_ipv4": {
"Address": "10.20.212.252",
"Port": 0
},
"wan_ipv4": {
"Address": "10.20.212.252",
"Port": 0
}
},
"Weights": {
"Passing": 1,
"Warning": 1
}
},
"admin-ui-sidecar-proxy": {
"Address": "10.20.212.252",
"Datacenter": "stg-datacenter",
"EnableTagOverride": false,
"ID": "admin-ui-sidecar-proxy",
"Kind": "connect-proxy",
"Port": 20000,
"Proxy": {
"Config": {
"envoy_prometheus_bind_addr": "0.0.0.0:9102",
"protocol": "http"
},
"DestinationServiceID": "admin-ui",
"DestinationServiceName": "admin-ui",
"Expose": {},
"LocalServiceAddress": "127.0.0.1",
"LocalServicePort": 80,
"MeshGateway": {},
"Upstreams": [
{
"Config": {
"protocol": "http"
},
"DestinationName": "service-a",
"DestinationType": "service",
"LocalBindPort": 1000,
"MeshGateway": {}
},
{
"Config": {
"protocol": "http"
},
"DestinationName": "service-b",
"DestinationType": "service",
"LocalBindPort": 2000,
"MeshGateway": {}
},
{
"Config": {
"protocol": "http"
},
"DestinationName": "service-c",
"DestinationType": "service",
"LocalBindPort": 3000,
"MeshGateway": {}
},
{
"Config": {
"protocol": "http"
},
"DestinationName": "service-d",
"DestinationType": "service",
"LocalBindPort": 4000,
"MeshGateway": {}
},
{
"Config": {
"protocol": "http"
},
"DestinationName": "service-e",
"DestinationType": "service",
"LocalBindPort": 5500,
"MeshGateway": {}
}
]
},
"Service": "admin-ui-sidecar-proxy",
"SocketPath": "",
"TaggedAddresses": {
"lan_ipv4": {
"Address": "10.20.212.252",
"Port": 20000
},
"wan_ipv4": {
"Address": "10.20.212.252",
"Port": 20000
}
},
,
"Weights": {
"Passing": 1,
"Warning": 1
}
}
}
The problem is when I bootstrap the envoy config and then start envoy. At that point, the logs are flooded with (see https://github.com/hashicorp/consul/issues/10714#issuecomment-895622746 for full log file)
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.569z [error] agent.proxycfg: failed to handle update from watch: service_id=admin-ui-sidecar-proxy id=discovery-chain:service-a error="error filling agent cache: rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.570z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.209.189:8300 error="rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.571z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.202.159:8300 error="rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.571z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.208.112:8300 error="rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.634z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.209.189:8300 error="rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.719z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.202.159:8300 error="rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.976z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.208.112:8300 error="rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.976z [warn] agent.cache: handling error in cache.notify: cache-type=compiled-discovery-chain error="rpc error making call: permission denied" index=0
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.976z [error] agent.proxycfg: failed to handle update from watch: service_id=admin-ui-sidecar-proxy id=discovery-chain:service-b error="error filling agent cache: rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.976z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.209.189:8300 error="rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.977z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.202.159:8300 error="rpc error making call: permission denied"
aug 09 22:48:53 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:53.978z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.208.112:8300 error="rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.152z [error] agent.client: rpc failed to server: method=intention.match server=10.20.209.189:8300 error="rpc error making call: rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.158z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.202.159:8300 error="rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.403z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.208.112:8300 error="rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.447z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.209.189:8300 error="rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.484z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.202.159:8300 error="rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.655z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.208.112:8300 error="rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.712z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.209.189:8300 error="rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.792z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.202.159:8300 error="rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.867z [error] agent.client: rpc failed to server: method=discoverychain.get server=10.20.208.112:8300 error="rpc error making call: permission denied"
aug 09 22:48:54 ip-10-20-212-252.us-west-2.compute.internal consul[14050]: 2021-08-09t22:48:54.867z [warn] agent.cache: handling error in cache.notify:
Envoy is configured with it's own acl token with the policy. I have confirmed this token is correctly set in the envoy bootstrap config x-consul-token
field.
service "admin-ui" {
policy = "write"
}
service "admin-ui-sidecar-proxy" {
policy = "write"
}
service_prefix "" {
policy = "read"
}
node_prefix "" {
policy = "read"
}
If the token that envoy is using is added to the sidecar service definition then the permission errors go away and envoy is configured as expected.
Overview of the Issue
I am in the process of testing the upgrade of our Consul Connect setup from 1.9.5 to 1.10.1. After upgrading the Consul agent to 1.10.1 none fo the expected listeners are added to envoy breaking service mesh communication.
Reproduction Steps
Steps to reproduce this issue, eg:
Register connect services. Here is my service config.
In my case, the upstreams are registered on the nodes containing those services.
I then bootstrap the envoy config with the
consul connect envoy -envoy-version=1.16.4 -proxy-id=ui-sidecar-proxy -bootstrap
on the node. Envoy is then started by a system service. With the 1.9 consul agent envoy is correctly configured with the expected listenersNext, upgrade the Consul agent to 1.10. After upgrading the agent run the
consul connect envoy
boostrap command again and then restart envoy with the new bootstrap config. No envoy listeners are created apart from the local admin and prometheus endpointsAt this point no traffic can be sent over the service mesh.
Consul info for both Client and Server
Client info
``` agent: check_monitors = 0 check_ttls = 0 checks = 4 services = 4 build: prerelease = revision = db839f18 version = 1.10.1 consul: acl = enabled known_servers = 3 server = false runtime: arch = arm64 cpu_count = 2 goroutines = 195 max_procs = 2 os = linux version = go1.16.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 205547 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 647721 members = 45 query_queue = 0 query_time = 1 ```Server info
``` agent: check_monitors = 0 check_ttls = 0 checks = 2 services = 2 build: prerelease = revision = db839f18 version = 1.10.1 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 10.20.208.112:8300 server = true raft: applied_index = 373096103 commit_index = 373096103 fsm_pending = 0 last_contact = 4.978615ms last_log_index = 373096103 last_log_term = 388662 last_snapshot_index = 373087177 last_snapshot_term = 388662 latest_configuration = [{Suffrage:Voter ID:6cd485d0-1ccc-5049-8105-79abb943105e Address:10.20.209.189:8300} {Suffrage:Voter ID:b254668f-b57e-f763-5f3e-3f5ca38ecbc5 Address:10.20.208.112:8300} {Suffrage:Voter ID:e0e2704e-96c9-589d-c4eb-02f0ecf11bd2 Address:10.20.202.159:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 388662 runtime: arch = arm64 cpu_count = 2 goroutines = 1478 max_procs = 2 os = linux version = go1.16.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 205547 failed = 1 health_score = 0 intent_queue = 0 left = 1 member_time = 647725 members = 47 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1610 members = 3 query_queue = 0 query_time = 1 ```Operating system and Environment details
This affects both our Amazon Linux 2 VMs as well as the sidecars injected by consul-k8s v0.25.0 in our Kubernetes clusters.
I'm happy to provide any additional info needed.