Open valarauca opened 4 years ago
Hi @valarauca,
Would you mind clarifying a few questions about your configuration?
- Configure the cluster to add a service defination for a service running on a host REMOTE to the 3 consul server hosts.
Are you configuring the exact same service definition (i.e., same instance IP) on each of the 3 servers?
Normally service definitions are created on a specific agent (Consul client) where the destination service exists. It sounds like you may be configuring things a bit differently, which is why I'm asking for the clarification.
- On another host REMOTE from the consul server hosts, configure it (as a consul connect proxy ) as an egress end point to direct traffic to the service defination.
The built-in proxy (consul connect proxy
) is only meant for dev/test workloads. We recommend using Envoy for production deployments.
Would you mind testing your setup using Envoy as the proxy, and see if Consul still exhibits the same behavior?
Are you configuring the exact same service definition (i.e., same instance IP) on each of the 3 servers?
Not the exactly same, there are minor differences (server_name
, instance_ip
, cert
paths) fields.
But as far as service definitions, check definitions, supported ciphers, cert rotation periods, listening ports, etc. the configs are identical.
I'll post follow up comments with sample config, including them all violates github comment size rule.
Normally service definitions are created on a specific agent (Consul client) where the destination service exists. It sounds like you may be configuring things a bit differently, which is why I'm asking for the clarification.
I guess we are.
In most of our topologies the majority of our services (and proxies) are not co-located on the same \(host\|server\|virtual_machine\|node\)
as any of the consul server
(s) in question. So it seemed logical that every server have some knowledge of them.
Would you mind testing your setup using Envoy as the proxy, and see if Consul still exhibits the same behavior?
We actually selected consul
primarily because it didn't require envoy
.
Internally we support several older version of RHEL
and CentOS
who's default glibc
(v2.18
) doesn't export the correct symbols for envoy
to compile & run. It lacks full CXX11 compatibility.
I'm honestly surprised to hear this because the existing documentation doesn't state consul connect proxy
is a secondary citizen, nor does it discourage people from using it. Almost all existing tutorials directed us to use the built in proxy.
Sample Server Configuration
3^rd server is more of the same
Thank you sharing for these additional details about your setup & Consul configuration.
I didn't see any example service definitions in the configs you provided. Would you mind sharing one of those as well?
We actually selected consul primarily because it didn't require envoy. Internally we support several older version of RHEL and CentOS who's default glibc (v2.18) doesn't export the correct symbols for envoy to compile & run. It lacks full CXX11 compatibility.
Which versions of RHEL & CentOS are you using? Have you tried running a precompiled Envoy binary extracted from the Docker container, or from a source like GetEnvoy.io?
Envoy's docs for building a binary state supports Ubuntu 16 and newer. However I don't see any specific mention of compatibility for RHEL-based distros. I'm genuinely interested in knowing whether there's a compatibility issue with that distro family.
I'm honestly surprised to hear this because the existing documentation doesn't state consul connect proxy is a secondary citizen, nor does it discourage people from using it. Almost all existing tutorials directed us to use the built in proxy.
The first sentence on the built-in proxy page says, "Consul comes with a built-in L4 proxy for testing and development with Consul Connect." This snippet was added about 10 months go in commit 9915e22. It may not have been present when you first reviewed the docs, or perhaps may not be as visible as it should be. Regardless, my apologies. I'll see about updating this to make it more visible.
Thanks for getting back to me so quick.
Which versions of RHEL & CentOS are you using?
Internally we have clusters as old as RHEL6.10, put primarily RHEL7.7 (and older).
Unfortunately the ubuntu based extraction doesn't work. I have attempted it, but its no differnce. The issue is Ubuntu 16.04 ships with glibc v1.23
, while envoy
requires a minimum version of v1.18
. This is related to how envoy
, or more particular libcstdc++
handles thread local storage in C++11. This Google group discusses it.
The new thread local stuff wasn't added until RHEL7.8
I'm genuinely interested in knowing whether there's a compatibility issue with that distro family.
kind of
RHEL7.7 ship glibc v2.17
and lacks the CXX11 ABI compatibility that glibc v2.18
adds. This means unless you statically link glibc
into the envoy
executable (which isn't recommended), or upgrade glibc
manually (also not recommended) there isn't many options.
RHEL7.8 was released yesterday (March 31st 2020) so it isn't that critical. It ships with glibc v2.28
so envoy should work with standard binaries.
But as you may guess, we're not exactly ready to upgrade everything to RHEL7.8
It may not have been present when you first reviewed the docs, or perhaps may not be as visible as it should be. Regardless, my apologies. I'll see about updating this to make it more visible.
No worries. Luckily I (attempted) to write a fix -> https://github.com/hashicorp/consul/pull/7506
Internally we support several older versions of RHEL and CentOS who's default glibc (v2.18) doesn't export the correct symbols for envoy to compile & run. It lacks full CXX11 compatibility.
@valarauca We also had this issue and started a project you might be interested with as it uses HAProxy over Envoy: https://github.com/haproxytech/haproxy-consul-connect
Overview of the Issue
When using consul service mesh, with
consul server
running on 3 hosts, and usingconsul connect proxy
to accept incoming (ingress mode) connections on NOTconsul server
hosts (e.g.: a Remote Host); the connections created byconsul connect proxy
(running in egress mode) will fail at an extremely predictable rate dialing the incorrect host. This adds a staggering number of network errors into the service application.This issue appears to be caused by:
consul connect proxy
(running in egress mode) selection of a health record to dial.Service.Address
information (see the-ttl
check, it contains noService.Address
field). This is the information I see returned fromv1/health/connect/${service}
which I believe is what thehealth.Connect
call makes (asConnect
invokesservice
.consul connect proxy
dialing the end point which replied to thev1/health/connect/${service}
. This means theconsul connect proxy
running in egress mode will dial the API end point, instead of the host service.The end resulting being
consul connect proxy
connections will fail at a rate of :This means with 3
consul server
components I see a failure rate of 25%, for EVERY connection which crosses the consul service mesh (between hosts where theingress
proxy is no colocated with aconsul server
component). This has added an extremely high error rate which has prevent us from running several legacy applications within the consul-service mesh.Reproduction Steps
consul server
hostsservice defination
for a service running on a host REMOTE to the 3consul server
hosts.consul server
hosts, configure it (as aconsul connect proxy
) as anegress
end point to direct traffic to theservice defination
.service defination
consul operator raft list-peers
to ensure allconsul server
's have joined the quorum.v1/health/state/passing
to observe everything is healthy.consul connect proxy
will attempt to dial theconsul server
host, instead of theconsul connect proxy
configured as an ingress host.NOTE: Setting the
"enable_tag_override": true
within the service definition does not effect the problem.Consul info for both Client and Server
Omitted, as this has not occurred on a single cluster, or single client.
I can provide this information is needed, but the cluster is healthy.
Operating system and Environment details
I have replicated this error on:
On the follow distro's:
On the following topologies:
On consul versions:
Log Fragments
example:
Same error, Different Service:
Same error on a different service.
Question
If there anyone to disable this consul generated
-ttl
check?