jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.6k stars 2.45k forks source link

How sampler.type=remote works #832

Closed xihw closed 6 years ago

xihw commented 6 years ago

" Remote (sampler.type=remote, which is also the default) sampler consults Jaeger agent for the appropriate sampling strategy to use in the current service. This allows controlling the sampling strategies in the services from a central configuration in Jaeger backend, or even dynamically (see Adaptive Sampling). "

This is excerpted from Jaeger Doc and it looks pretty confusing to me. Can you help me to understand it with the following questions ?

  1. "consults Jaeger agent for the appropriate sampling strategy" -- As I know there are two places to configure sampling rate: jaeger-client and jaeger-collector. What role does Jaeger agent play here?

  2. "This allows controlling the sampling strategies in the services from a central configuration in Jaeger backend" -- Does "a central configuration in Jaeger backend" mean jaeger-collector ?

  3. What if we use zipkin-client + jaeger backend (jaeger-collector + jaeger-ui + storage) ? In this case we don't have jaeger-agent running, how does the "remote consulting" work ?

  4. A follow-up question on 3. Without jaeger-agent, how is batch handled ? According to zipkin's doc: https://zipkin.io/pages/architecture.html, I interpret "Transport" as "jaeger-agent", in zipkin-client + jaeger backend scenario, do we discard "Transport" ?

yurishkuro commented 6 years ago
  1. agent proxies the requests to the collector, so that the client does not need to know where collectors are located (agent is usually on the localhost)
  2. yes, configuration (and soon adaptive calculations) come from the collectors, but clients receive them via agent
  3. remotely controlled samplers are only supported by Jaeger clients, not Zipkin clients.
  4. Not sure which "batch" you are referring to.
xihw commented 6 years ago
  1. Okay, based on the two configurations mentioned by: https://www.jaegertracing.io/docs/sampling/#ClientSamplingConfiguration https://www.jaegertracing.io/docs/sampling/#CollectorSamplingConfiguration and considering following scenario, can you answer how many percent of spans reported from service to agent ? How many percent sent from agent to collector ? And how many percent saved to storage by collector ?

a. ClientSamplingConfiguration says probabilistic 0.1, and CollectorSamplingConfiguration says probabilistic 0.2

b. ClientSamplingConfiguration says remote, and CollectorSamplingConfiguration says probabilistic 0.2

  1. "batch" I mean send spans to collector in batches to avoid heavy traffic issue. That's my understanding to description here: https://www.jaegertracing.io/docs/architecture/ (search 'batch'). With Zipkin + jaeger backend, I don't think we have a mechanism to do the batch and that's my concern.
black-adder commented 6 years ago

2) Any span that is reported by the service will be persisted, ie the decision is made once. In your example, the ClientSamplingConfiguration will be used instead of the CollectorSamplingConfiguration so the sampling probability will be 0.1. If you instead were to use sampler.type=remote in the ClientSamplingConfiguration, then the client will use the CollectorSamplingConfiguration of 0.2. (client MUST be configured with sampler.type=remote in order for it to receive sampling rates from the collector, or else it will use the sampling rate provided by the service owner)

black-adder commented 6 years ago
  1. The jaeger clients are designed to always batch spans before sending them. If no jaeger-agent is present, the golang and java jaeger clients can be configured to send batch spans over http.
xihw commented 6 years ago

Any span that is reported by the service will be persisted

You mean persisted into storage?

In your example, the ClientSamplingConfiguration will be used instead of the CollectorSamplingConfiguration so the sampling probability will be 0.1

Sorry still confused when is the 0.1 used ? service -> agent or agent -> collector or collector -> storage ? or all of them (if all of them then finally 0.1 0.1 0.1 will be stored in DB right) ?

@black-adder

black-adder commented 6 years ago

The sampling rate is only used at the service, 0.1 of traces will be stored in the DB.

xihw commented 6 years ago

Ok! so the sampling only happens in service before sending spans out. One more question:

configuration (and soon adaptive calculations) come from the collectors, but clients receive them via agent

What is the flow ? From service's standpoint, is it pull / push ? And when does it happen? Once when service is up or periodically ?

black-adder commented 6 years ago

services pulls from agent every minute, this is configurable: https://github.com/jaegertracing/jaeger-client-go/blob/master/config/config.go#L86

We haven't done this yet but I've always wanted to do push. It's on my personal road map.

xihw commented 6 years ago

Can you also help me understand another sampling propagation question -- Will a service generate a span for incoming request before deciding sample or not sample it ? Will unsampled span propagated between services ?

black-adder commented 6 years ago

Sampling and generation of a span happens roughly at the same time. Context is always propagated between services (even if unsampled).

xihw commented 6 years ago

If service B receives a request with context saying something like {"span_a", "unsampled"}, B will still create a span as child of "span_a" and propagate continuously , but won't report it, is it correct ?

black-adder commented 6 years ago

yes

xihw commented 6 years ago

Ok so does it mean that it's possible for every request being traced by putting it's span info into the log even though we do sampling ? If so do you have any resource showing how to do that ?

black-adder commented 6 years ago

I'm not sure I understand the question. Are you asking if span logs are always persisted even if we do sampling?

xihw commented 6 years ago

I'm asking is it possible to use some logging framework like MDC (http://www.baeldung.com/mdc-in-log4j-2-logback) to log the trace id for every single request even if we do sampling.

black-adder commented 6 years ago

yes you can log the trace id for every request but since you're sampling, some logs will have trace ids without a persisted trace.

black-adder commented 6 years ago

This is a golang example: https://github.com/jaegertracing/jaeger/blob/master/examples/hotrod/pkg/log/spanlogger.go

however, here we're doing more than just logging the traceid, we're dual logging to both the log reporter and into the span.

black-adder commented 6 years ago

closing issue, feel free to open if you have more questions

ecourreges-orange commented 5 years ago
1. agent proxies the requests to the collector, so that the client does not need to know where collectors are located (agent is usually on the localhost)

The agent proxies the config request to the collector through which connection? The TChannel or gRPC, whichever one is connected? These docs don't really explain where the sampling config is sent through: https://www.jaegertracing.io/docs/1.14/getting-started/#all-in-one https://www.jaegertracing.io/docs/1.14/deployment/#collectors Also it would be a nice improvement to know which are encryptable or encrypted by default, This page does not detail which protocol between which component has encryption support: https://github.com/jaegertracing/jaeger/issues/458

Thank you.

yurishkuro commented 5 years ago

The agent proxies the config request to the collector through which connection? The TChannel or gRPC, whichever one is connected?

whichever one you configure on the agent. We recommend gRPC.

Also it would be a nice improvement to know which are encryptable or encrypted by default,

See #1718

These docs don't really explain where the sampling config is sent through:

Can you elaborate what can be improved in the docs? If you're using remote sampler, then the sampling configuration is defined in the collectors, and is pulled by the clients periodically client<-agent<-collector

ecourreges-orange commented 5 years ago

Thanks, now it's all coming together through different info from the different refered github issues.

Can you elaborate what can be improved in the docs? If you're using remote sampler, then the sampling configuration is defined in the collectors, and is pulled by the clients periodically client<-agent<-collector

For improvements to the doc, here are ideas:

Thanks.

vamshi67 commented 4 years ago

Does 'remote' sampling work with http-sender? In my aks cluster setup, I haven't configured 'jaeger-agent'.

yurishkuro commented 4 years ago

Sampler has nothing to do with Sender, it's an independent component. It can work with both the agent and the collector.

vamshi67 commented 4 years ago

Thanks Yuri for the quick response. I really appreciate your help on this.

I'm using Jaeger K8s operators and has following sampling strategy in the configmap: _

apiVersion: v1 data: sampling: '{"default_strategy":{"operation_strategies":[{"operation":"/health","param":0,"type":"probabilistic"},{"operation":"/metrics","param":0,"type":"probabilistic"}],"param":0.1,"type":"probabilistic"}}' kind: ConfigMap metadata: creationTimestamp: "2020-06-15T23:42:14Z" labels: app: jaeger app.kubernetes.io/component: sampling-configuration app.kubernetes.io/instance: jaeger app.kubernetes.io/managed-by: jaeger-operator app.kubernetes.io/name: jaeger-sampling-configuration app.kubernetes.io/part-of: jaeger name: jaeger-sampling-configuration namespace: monitoring

*** We're using monitoring namespace instead of observability. _

Client application has following properties:

**

sampler.type=const sampler.sampling-rate=1

**

Since these properties are defined in the application's properties file, I'm overriding using k8s environment variables. I have set sampler.type to remote. As I don't know what value should be given to sampling-rate when sampler.type is set to remote, I set it as 1

With this when I created the pod, every sample is being collected. I'm not sure why it is not honoring remote configuration.

Am I missing anything?

yurishkuro commented 4 years ago

The numeric value of 1 is treated as 100% default probability when the sampler cannot contact the backend. It's possible that in your deployment it cannot reach the backend and never gets the 0.1 probability. The sampler emits metrics about unsuccessful configuration pulls.

tcoln commented 4 years ago
1. agent proxies the requests to the collector, so that the client does not need to know where collectors are located (agent is usually on the localhost)

2. yes, configuration (and soon adaptive calculations) come from the collectors, but clients receive them via agent

3. remotely controlled samplers are only supported by Jaeger clients, not Zipkin clients.

4. Not sure which "batch" you are referring to.

Dear yuri, I have a question, if remotely contorlled samplers are only suporter via agent, and agent pulls config via gRPC+protobuff. Then what is the sampling.thrift for?

yurishkuro commented 4 years ago

Previously agent was using Thrift to retrieve sampling from collector. Not it uses protobuf, but the clients consume sampling as JSON, and that JSON is still generated from Thrift.

tcoln commented 4 years ago

Previously agent was using Thrift to retrieve sampling from collector. Not it uses protobuf, but the clients consume sampling as JSON, and that JSON is still generated from Thrift.

You mean the sample strategies are sent to agents from collector via thrift previously but via protobuff+gRPC now ? I know client get sampling json using http+5778 port. So I care about how collector sent them to agent.

yurishkuro commented 4 years ago

collector to agent is grpc

tcoln commented 4 years ago

collector to agent is grpc

Thanks, yuri.

sharninder commented 3 years ago

The numeric value of 1 is treated as 100% default probability when the sampler cannot contact the backend. It's possible that in your deployment it cannot reach the backend and never gets the 0.1 probability. The sampler emits metrics about unsuccessful configuration pulls.

I'm not sure this is completely correct. Or there is a bug in this code path. I'm setting sampler type to remote and leaving the param yet, the param value is being set to 1 by default even when the remote actually has a param of 0.5. Seems like a bug to me.