hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.37k stars 4.42k forks source link

RandomSampling on ingress gateway is suppressing distributed tracing #8519

Open Gufran opened 4 years ago

Gufran commented 4 years ago

Ingress Gateway listener requires use of Envoy's x-client-trace-id header to initiate the trace. Without this header the requests are not traced at all.

Reproduction Steps

Create a new ingress gateway with tracing configuration. This example uses datadog tracer:

Ingress Gateway Service Config ``` service { name = "igw" port = 9999 kind = "ingress-gateway" proxy { config { envoy_dogstatsd_url = "udp://127.0.0.1:8125" envoy_tracing_json = <<-EOF { "http": { "name": "envoy.tracers.datadog", "config": { "collector_cluster": "datadog_trace_collector", "service_name": "igw" } } } EOF envoy_extra_static_clusters_json = <<-EOF { "name": "datadog_trace_collector", "type": "STATIC", "connect_timeout": "1s", "upstream_connection_options": { "tcp_keepalive": {} }, "load_assignment": { "cluster_name": "datadog_trace_collector", "endpoints": [ { "lb_endpoints": [ { "endpoint": { "address": { "socket_address": { "address": "127.0.0.1", "port_value": 8126 } } } } ] } ] } } EOF } } } ```

deploy a connect enabled service with similar tracing configuration and perform HTTP requests to initiate tracing.

Consul info for both Client and Server

Client info ``` agent: check_monitors = 0 check_ttls = 0 checks = 5 services = 3 build: prerelease = revision = 3111cb8c version = 1.8.0 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 2 goroutines = 24350 max_procs = 2 os = linux version = go1.14.4 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 49 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 3175 members = 34 query_queue = 0 query_time = 1 ```
Server info ``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 3111cb8c version = 1.8.0 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 10.101.3.188:8300 server = true raft: applied_index = 708336420 commit_index = 708336420 fsm_pending = 0 last_contact = 57.987616ms last_log_index = 708336420 last_log_term = 95 last_snapshot_index = 708323646 last_snapshot_term = 95 latest_configuration = [{Suffrage:Voter ID:04b87f04-ce07-0976-b4de-b29a3613b21a Address:10.101.4.10:8300} {Suffrage:Voter ID:daa06ae3-ea15-6b9e-791c-da38a2b66572 Address:10.101.3.188:8300} {Suffrage:Voter ID:60d8cc9c-b651-5e4f-1425-cfce296456e9 Address:10.101.4.50:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 95 runtime: arch = amd64 cpu_count = 2 goroutines = 6536 max_procs = 2 os = linux version = go1.14.4 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 49 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 3175 members = 34 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 211 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

AmazonLinux 2

Log Fragments

None


This problem seems to be at https://github.com/hashicorp/consul/blob/v1.8.0/agent/xds/listeners.go#L933 which is suppressing traces at listener level. I tried to build a binary without this particular RandomSampling configuration and tracing started working again.

It would be helpful to control sampling using configuration options. We like to initiate tracing with every request that lands on the internet facing listener and then let the destination service decide if tracing should continue.

pvyaka01 commented 4 years ago

I'm having a similar problem with tracing and opened an issue a couple of days back. Glad you probably found the cause! Hope this will get fixed as both tracing and access_logs are important for large volume transactions. Not sure why access_log is set to /dev/null and we've to resort to injecting a whole new filter chain using envoy_public_listener_json. I was hoping we can at least get tracing to work and perhaps what you found can be addressed soon.

Gufran commented 4 years ago

We're running the patched consul binary in our staging environment and I can confirm that removing the random_sampling directive helped. I don't know the full impact of it in other areas otherwise I would've proposed the PR.

pvyaka01 commented 4 years ago

Consul team, any ideas on this one? Thanks!

pvyaka01 commented 4 years ago

I guess not many folks are using tracing at this point. Haven't heard any suggestions yet on this one.

dsouzajude commented 4 years ago

Just curious to know how you have registered the gateway into Consul! Was it through the following command?:

consul connect envoy -gateway=ingress -register -service ingress-service -address '{{ GetInterfaceIP "eth0" }}:8888'

... like how it's mentioned in this tutorial because I see the approach you have is different. Could you comment on that?

I'm asking because I also have been trying to get tracing to work, but i'm adding the tracing config in the proxy-defaults which seems a bit hackish, but atleast that seems to work how I want it!

pvyaka01 commented 4 years ago

@dsouzajude - yes, using the command you mentioned. Will be great to find out what you've added in proxy-defaults so we can check. Does Jaeger show tracing data from the proxy? Thanks

blake commented 4 years ago

Hi @pvyaka01, at the moment ingresses will not initiate a trace, but will propagate headers if they are received from a downstream caller. This was an intentional decision, hence the comment in consul/blob/v1.8.0/agent/xds/listeners.go.

Don't trace any requests by default unless the client application explicitly propagates trace headers that indicate this should be sampled.

This would explain the behavior you've described in #8503. I've marked this issue an enhancement / feature request to make this parameter configurable so that proxies can be configured to initiate a trace.

pvyaka01 commented 4 years ago

Ok, thank you!

dsouzajude commented 4 years ago

@pvyaka01 Here is a look at my proxy-defaults config. Note that envoy_tracing_json field enables tracing on the default level and for now, this is just a "hack" until it's possible to enable tracing specifically on the Ingress Gateway level. With this config, somehow tracing got enabled on the Ingress Gateway and it shows up as an object in my trace.

On the client, i curl the ingress gateway without passing any trace header information. Since i'm using AWS X-Ray for tracing, i think AWS X-Ray adds the trace headers if it's not present but i'm not a 100% sure.

trace
proxy-defaults config ```terraform Kind = "proxy-defaults" Name = "global" Config { local_connect_timeout_ms = 1000 handshake_timeout_ms = 10000 protocol = "http" bind_address = "0.0.0.0" bind_port = 21000 envoy_stats_flush_interval = "60s" envoy_extra_static_clusters_json = <
Gufran commented 4 years ago

Hi @pvyaka01, at the moment ingresses will not initiate a trace, but will propagate headers if they are received from a downstream caller. This was an intentional decision, hence the comment in consul/blob/v1.8.0/agent/xds/listeners.go.

Don't trace any requests by default unless the client application explicitly propagates trace headers that indicate this should be sampled.

This would explain the behavior you've described in #8503. I've marked this issue an enhancement / feature request to make this parameter configurable so that proxies can be configured to initiate a trace.

I've managed to put together a draft PR in #8714, and I just want to make sure I'm not stepping on anyone's toes before putting more work into it. @blake is it possible for you to disclose any progress you guys have made on it internally? If nobody else is working on it then I'd be happy to continue my work on #8714 with some initial design review.

blake commented 4 years ago

Hi @Gufran, our team has not yet started working on this so we appreciate you contributing a PR. We will try to have someone review it soon. Thanks again.

pvyaka01 commented 3 years ago

Any updates on this request pls?

Gufran commented 3 years ago

@pvyaka01 I have some changes in #8714 to address this. That PR is waiting on a design review right now.

pvyaka01 commented 3 years ago

@dsouzajude - not sure which version of Consul you're using. I am using "Consul v1.9.0-beta2". For the life of me, I cannot get envoy_public_listener_json to work for ingress gateway. Perhaps i'm not doing something right. I took your proxy-defaults as-is and started up ingress, no luck.

I can see tracing and extra_static_clusters show up in envoy config, only the public_listener is not.