Closed mmisztal1980 closed 5 years ago
The getting started guide has the proxies and the consul agent run within the same network namespace with the docker flag --network host
.
In your example your sidecar proxies and consul agent do not share a network namespace, so the default gRPC server address will need to be explicitly configured like you did with the HTTP api address. The gRPC server serves the xDS protocol for envoy.
In your example you would need to specify -grpc-addr consul:8502
in addition to -http-addr http://consul:8500
like:
-sidecar-for core-demo-api2 -admin-bind 0.0.0.0:19000 -http-addr http://consul:8500 -grpc-addr consul:8502
I picked up on the misconfig because I saw that the .configs[0].bootstrap.static_resources.clusters[0].hosts[0].socket_address
value in your envoy admin config dump was set to { "address": "127.0.0.1", "port_value": 8502 }
instead of the ip address of the consul agent.
Hi @rboyer , thanks for the hint, that seems to have helped.
Can you elaborate on that behavior? Why is the gRPC setting required when not specifying a docker network? I seem to have missed that bit in the docs or it's not well described
In a non-containerized setup a consul agent (running in lightweight client mode) and applications using that consul agent communicate over a localhost connection using the HTTP api. This avoids the problem of discovering your service discovery system by assuming you can simply communicate over a prearranged port (:8500
).
As of v1.3, consul also listens on a gRPC port (defaults to :8502
) when Connect is enabled. This speaks the xDS protocol specifically for use by the envoy instances. Envoy itself knows nothing of consul, so it cannot use the consul HTTP api.
The consul connect envoy
subcommand briefly uses the consul HTTP api on startup before exec-ing the envoy binary which then strictly communicates with the running consul agent using gRPC from there on out speaking only xDS. So direct access to both ports is necessary for the sidecar to be setup and run.
The getting started guide uses host networking for simplicity so that the consul agent and all of the envoy instances are free to communicate with each other directly over localhost.
If these are not co-located then the defaults cannot be used and will have to be explicitly configured.
Ok, this makes a ton of sense now.
I've obtained another config dump, the proxy for core-demo-api1 seems to be aware of core-demo-api2 now:
I'm still getting 404s when calling core-demo-api2 from core-demo-api1 through the core-demo-api1-sidecar-proxy though.
Should I use:
http://core-demo-api1-sidecar-proxy:19000/core-demo-api2/api/values
or http://core-demo-api1-sidecar-proxy:19000/api/values
while setting the host
header to core-demo-api2
?
or something completely different?@mmisztal1980 your client apps should talk to whichever local port the upstream is listening on - unless you setup something I missed hostnames like core-demo-api1-sidecar-proxy
aren't available.
For example your service definition for core-demo-api1
has the following upstream definition:
{
"destination_name": "code-demo-api2",
"local_bind_port": 80
}
This is saying "Please configure the proxy to listen on localhost:80
and proxy those connections to the code-demo-api2
service".
So the proxy will try to do just that but will (probably) collide since your actual service is already listening on port 80.
The idea here is that you pick some port for each upstream that you want to expose the service on over loopback. The port number is arbitrary and the only thing that cares about it is your application.
For example if you changed that definition to:
{
"destination_name": "code-demo-api2",
"local_bind_port": 8080
}
Then the proxy would listen on localhost:8080
(technically 127.0.0.1:8080
) and you app would be able to connect just using http://127.0.0.1:8080/api/values
.
Note that this is only layer 4 (TCP/TLS) proxying so there are no HTTP paths or routes in the mix. L7 support will come later.
Hope you get that working!
@banks Thanks for clearing that up, now I understand the concept.
I've reconfigured the core-demo-api1 registration:
I've tried to execute a call vs my service's proxy, my client keeps throwing exceptions:
I've signed onto the proxy container to examine whether or not, the port is being listened on:
netstat -ano
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 0 127.0.0.11:32977 0.0.0.0:* LISTEN off (0.00/0/0)
tcp 0 0 0.0.0.0:19000 0.0.0.0:* LISTEN off (0.00/0/0)
tcp 0 0 172.19.0.5:49484 91.189.88.161:80 TIME_WAIT timewait (0.00/0/0)
tcp 0 0 172.19.0.5:44220 172.19.0.3:8502 ESTABLISHED off (0.00/0/0)
udp 0 0 127.0.0.11:55487 0.0.0.0:* off (0.00/0/0)
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags Type State I-Node Path
19100
being listed :/BTW, you've mentioned that local_bind_port
will listen on the localhost
interface. I'd prefer 0.0.0.0
in my scenario (multiple docker containers not running in a k8s pod (yet)) - how can I achieve that?
I've just re-read the paragraph on Sidecar Service Registration.
The upstreams use local_bind_port
, while in the Sidecar Service Defaults there's no mention of that parameter, but there is something else listed: local_service_port
under proxy
.
Is one transformed into the other? I think the docs are somewhat inconsistent here and don't really offer guidance on how to solve my above question :/
@mmisztal1980 can you clarify how you are running this again?
Port 19000 happens to be the port we chose for the Envoy admin API (which you can't just disable) so iuf you are using Envoy here then it's not a great choice for your own stuff and might explain why you are seeing things listening on 19000 but not getting the response you expect.
So I suspect you are not seeing any proxy running at all based on your netstat output. How are you starting the proxy? Is this in Kube?
Can you include the output of the consul connect envoy
commands?
BTW, you've mentioned that local_bind_port will listen on the localhost interface. I'd prefer 0.0.0.0 in my scenario (multiple docker containers not running in a k8s pod (yet)) - how can I achieve that?
There is a local_bind_address
too but note that anything non-local is insecure in prod (fine for just testing etc.) Anything that can talk to the "private" listeners of the proxy can assume the identity and all access granted to that service which is why it's typical to expose this only over loopback.
If you are trying to get this to work with docker compose I recommend using shared network namespaces at least between the app and sidecar containers rather than just exposing them over the docker bridge. This is easy in docker, start up the app first then start the proxy container using `--network "container:core-demo-api2-container-name" or equivalent in Compose. then the app and the proxy can talk over localhost just like in a kube pod (that is all Kube is doing under the hood).
If the docker-compose file above is still roughly what you're using, I noticed another small plumbing issue.
Because envoy will be listening on 127.0.0.1 (loopback) for exclusive outbound traffic access, the sidecars need to share the same network namespace with your app so it can connect. The way you have them configured above each of your 5 containers (consul, app1, sidecar1, app2, sidecar2) gets its networking stack complete with local ip address and personal isolated 127.0.0.1 address.
For example, instead of:
services:
core-demo-api1:
image: core-demo-api1
...
core-demo-api1-sidecar-proxy:
image: consul-envoy
command: "-sidecar-for core-demo-api1 -admin-bind 0.0.0.0:19000 -http-addr http://consul:8500"
...
You should have something like:
services:
core-demo-api1:
image: core-demo-api1
...
core-demo-api1-sidecar-proxy:
image: consul-envoy
command: "-sidecar-for core-demo-api1 -admin-bind 0.0.0.0:19000 -http-addr http://consul:8500"
network_mode: "service:core-demo-api1"
...
And similar plumbing for core-demo-api2-sidecar-proxy
.
I'm also interested in knowing more about the lifespan of this POC. Is it mostly for kicking the tires of Connect in conjunction with your development stack or are you also planning on using something like docker compose to deploy a variation of the POC directly to a docker swarm cluster?
My intention is to do a poc on k8s, however, .NET Solutions developed under Visual Studio 2017 have a docker-compose integration, so I wanted to get that up and running 1st. That way I get do debug/test the solution locally before I deploy it to an actual k8s cluster.
In the long run, at my current company, we are working on decomposing a monolith and migrating it to a new hosting platform while retaining 100% uptime. Consul connect it v. interesting because it can provide a service mesh, and a consistent experience when services communicate with each other, regardless of the hosting platform. That's pretty much the genesis for the PoC.
@rboyer I hope this satisfies your question?
On a side note, this is also a brilliant opportunity to learn.
adding network_mode
has caused the healthCheck between consul and the proxies to fail :/
@rboyer @banks I've added the network_mode: service:{service-name}
to both proxies. As far as I understand that ought to make them reuse their counterparts' network stacks - so if I understand correctly, both a service container and it's counterpart proxy, should be able to communicate both ways via localhost
.
I've modified my REST call inside core-demo-api1
to call : http://127.0.0.1:19100
- the port I've declared above for core-demo-api1-sidecar-proxy
to route to core-demo-api2
. I'm getting connection refused, even though, when attaching to the proxy container I can see that 127.0.0.1:19100
is being listened on
Hey @mmisztal1980 as far as I understand network_mode
should behave as you expected so I can't see an obvious reason here for this not working. My guess would be that Envoy is failing to get configured correctly so isn't setting up the upstream listener.
Can you try adding -- -l DEBUG
to the end of the proxy command string - this is passed through to envoy and puts it into verbose logging mode. Then check the output of the Envoy process? It's possible it's failing to listen or not getting config delivered for some reason.
Hi @banks ,
I've added the debug option (only on core-demo-api1-sidecar-proxy) as you've suggested, here's the output.
What caught my eye: core-demo-api1-sidecar-proxy_1 | [2018-11-05 17:17:07.038][8][info][config] source/server/configuration_impl.cc:61] loading 0 listener(s)
. I guess this implies a config issue?
Yep does look like a config issue, sorry I forgot the detail but can you try one more time with -l trace
- that will show you the full discovery request/resposne from Envoy which will help figure out where the misconfiguration is.
Hi @banks , I've applied the setting you've suggested, here's the output.
https://gist.github.com/mmisztal1980/a0e36c0f1d1e277470cf318ceea64d04
@mmisztal1980 thanks.
I think I see the issue and it's potentially a bug combined with a config issue.
I think from that log what is happening is this:
core-demo-api1-sidecar-proxy
(envoy1 from now on) is requesting Clusters and being delivers the upstream and the local_app as expected (so all gRPC and agent config is good - it's talking to agent fine)service:core-demo-api2
and the agent apparently blocks on that request and never returns.So the question is why is Consul not delivering any results for service:core-demo-api2
. I suspect it's because that service isn't yet registered and passing healthchecks but here is the (possible) bug: if it doesn't find healthy instances, it should return and empty list so the rest of the proxy can continue to be set up. That wouldn't make this work for you since the proxy still wouldn't know where to connect, but it would at least prevent one unhealthy upstream from delaying proxy bootstrap where all the other upstreams might be fine.
This should solve itself as soon as your upstream service comes up and is healthy though - can you confirm that it does (i.e. look in the Consul UI or query via DNS/HTTP to see that the secondary instance it available). Since the service has no health checks it should be healthy immediately and the proxy for it should be too (by it's alias check).
So the mystery here is why is that upstream service discovery not working correctly?
I'll try to reproduce with your docker compose file and config later to debug this more.
Thanks for your help digging on this!
@banks The 2nd core-demo-api2 service is up and running and healthy.
Out of interest what do you get if you curl consl like (replace with actual host/ip etc).
curl -v consul:8500/v1/health/connect/core-demo-api2
and
curl -v consul:8500/v1/catalog/connect/core-demo-api2
On Fri, Nov 9, 2018 at 2:32 PM Maciek Misztal notifications@github.com wrote:
@banks https://github.com/banks The 2nd core-demo-api2 service is up and running and healthy.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/4868#issuecomment-437376512, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHYU_5ztEQzXvcKVtB6XfC1gBGUBielks5utZHpgaJpZM4X_fw5 .
Sure here it is,
Hmm not sure why it causes the Envoy config failure it does but one issue I see there is that your agent nodes are registering with 127.0.0.1
as their address.
That means the service discovery results for the service in one container is going to return local IP which seems wrong - you need the IP of the other container to connect out to it.
Typically you'd have a local Consul agent on each node that is configured to bind to the public IP of that node and then this would work as expected - services by default are advertised at their agent's bind (or advertise) address.
In this setup where you don't have "local" agents on each "node" (i.e. each api container network namespace) you would need to register the services with an explicit Address
set to their docker IP that other containers can connect to.
Alternatively you can get closer to a production simulation by starting an actual consul agent container for each "node" that also shares the api container's namespace. If you do that the arguments for http-addr etc shouldn't be needed as the proxy can connect to it's "local" agent just like a non-container setup on multiple hosts would do.
When I get a chance I still want to reproduce locally so I can figure out why Envoy hangs part way through config in this case. But let me know if that IP config helps.
Hi @banks , it's been a while. I was wondering if you folks have made any progress investigating this, as I'll be giving it another spin soon (tm)
I'm pretty sure I know why the hanging bug happens but it may only be part of the story here.
The bug is due to this line: https://github.com/hashicorp/consul/blob/c2a30c5fdd6229ce98d6271b58f12582d7dc9dfe/agent/xds/server.go#L349-L352
Basically assuming that if we didn't get a response yet (since this is all async internally) so part of the config is empty then we shouldn't bother sending it to the proxy. The problem is that in a case where there are legitimately no instances available (not registered yet or failing health checks) then we end up not sending the endpoints at all which Envoy hangs waiting for.
I think that's an easy fix, but based on your Curl output above I'm not really sure if it's the only issue going on with your setup.
Hey there, We wanted to check in on this request since it has been inactive for at least 60 days. If you think this is still an important issue in the latest version of Consul or its documentation please reply with a comment here which will cause it to stay open for investigation. If there is still no activity on this issue for 30 more days, we will go ahead and close it.
Feel free to check out the community forum as well! Thank you!
Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one :+1:
Hey there,
This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.
If you are still experiencing problems, or still have questions, feel free to open a new one :+1:.
Overview of the Issue
I've created a PoC environment for 2x .NET Core 2.1 Services communicating via Consul-Connect. The entire setup relies on a consul server instance, which uses a services.json file to perform the registrations in a 'static' way. If I understand the process correctly, the sidecar proxies should retrieve their configuration from the consul-server after starting up.
Once the consul server container is healthy, 2x sidecar-proxies start. At this point the entire setup is healthy:
When attempting to have core-demo-api1 call core-demo-api1 , I'm getting a 404 response. I've exposed core-demo-api1-sidecar-proxy 's port 19000 and obtained a config dump in which I do not see any routes defined to core-demo-api2, which I believe is the root cause for the communication issue between the 2 services. I believe I've followed the available documentation to the letter, so my situation can be a potential bug.
Reproduction Steps
consul-envoy/Dockerfile
services.json
docker-compose.yml
docker-compose.override.yml
Consul info for both Client and Server
Client info
**(!) Executed inside one of the sidecar-proxy containers (!)** ``` # consul info Error querying agent: Get http://127.0.0.1:8500/v1/agent/self: dial tcp 127.0.0.1:8500: connect: connection refused ```envoy.yaml
```yaml admin: access_log_path: /tmp/admin_access.log address: socket_address: protocol: TCP address: 127.0.0.1 port_value: 9901 static_resources: listeners: - name: listener_0 address: socket_address: protocol: TCP address: 0.0.0.0 port_value: 10000 filter_chains: - filters: - name: envoy.http_connection_manager config: stat_prefix: ingress_http route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: prefix: "/" route: host_rewrite: www.google.com cluster: service_google http_filters: - name: envoy.router clusters: - name: service_google connect_timeout: 0.25s type: LOGICAL_DNS # Comment out the following line to test on v6 networks dns_lookup_family: V4_ONLY lb_policy: ROUND_ROBIN hosts: - socket_address: address: google.com port_value: 443 tls_context: { sni: www.google.com } ```Server info
``` # consul info agent: check_monitors = 0 check_ttls = 0 checks = 4 services = 4 build: prerelease = revision = e8757838 version = 1.3.0 consul: bootstrap = true known_datacenters = 1 leader = true leader_addr = 127.0.0.1:8300 server = true raft: applied_index = 1439 commit_index = 1439 fsm_pending = 0 last_contact = 0 last_log_index = 1439 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:7df0a6a8-3f7a-10f6-d759-e46d1b024aa2 Address:127.0.0.1:8300}] latest_configuration_index = 1 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 2 goroutines = 103 max_procs = 2 os = linux version = go1.11.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 ```Operating system and Environment details
0 Windows 10 Enterprise x64 running Docker for Windows (MobyLinuxVM)
Log Fragments
TBD Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use
-log-level=TRACE
on the client and server to capture the maximum log detail.