open-telemetry / opentelemetry-java

OpenTelemetry Java SDK
https://opentelemetry.io
Apache License 2.0
2.01k stars 834 forks source link

Attributes to identify services that are using remote sampling #4860

Open MaheshGPai opened 2 years ago

MaheshGPai commented 2 years ago

Is your feature request related to a problem? Please describe. Currently there is no clear way for the collector to identify all the services that are using remote sampling policy - based on Jaeger remote sampling. So if we have to implement an adaptive sampling policy generation at the tracing back-end, one has to monitor root span traffic pattern for all services & root operations and generate the sampling policy for all services. When there is huge number of services & root operations this can be a problematic

Describe the solution you'd like Enhance the JaegerRemoteSampler class to add attributes to indicate that the service is leveraging remote sampling and also add additional attributes like the sampler type & param. In Jaeger this was achieved since the Probabilistic sampler used to add sampler.type & sampler.param attribute/tag - which is missing in all the samplers provided in the SDK. Additionally, can we consider adding sampler.type & sampler.param for all the samplers available in the SD?. Currently I believe only RateLimitingSampler (from jaeger-remote-sampler sdk-extenstion) does that.

jack-berg commented 2 years ago

In Jaeger this was achieved since the Probabilistic sampler used to add sampler.type & sampler.param attribute/tag - which is missing in all the samplers provided in the SDK. Additionally, can we consider adding sampler.type & sampler.param for all the samplers available in the SD?.

Adding attributes to the SamplingResult of the built in SDK samplers would require a change to the sampling specification.

I'm not sure why the RateLimiting Sampler currently includes attributes in its sampling decision. I can't find any references to those attributes in the JaegerRemoteSampler specification.

What specific attributes would you need?

MaheshGPai commented 2 years ago

In case of remote sampling the policy will serve either RateLimiting or Probabilistic sampler. Both these samplers add these tags/attributes for sampled span in Jaeger.

https://github.com/jaegertracing/jaeger-client-java/blob/428cc390c9f9cac633da7077a9a74a68212177dc/jaeger-core/src/main/java/io/jaegertracing/internal/samplers/ProbabilisticSampler.java#L49 https://github.com/jaegertracing/jaeger-client-java/blob/428cc390c9f9cac633da7077a9a74a68212177dc/jaeger-core/src/main/java/io/jaegertracing/internal/samplers/RateLimitingSampler.java#L44

Since RateLimiting sampler was ported from Jaeger it has those attributes added. But this is missing in TraceIdRatioBasedSampler. So instead of making changes in the SDK, would restricting the changes to jaeger remotesampler sdk-extension help? I have raised a PR #4870 for this. Please let me know if this approach is acceptable.

Adaptive Sampler - https://medium.com/jaegertracing/adaptive-sampling-in-jaeger-50f336f4334

We start with some default sampling probability p assigned to every endpoint and a target rate R of traces we want to collect, such as 1 trace per second per endpoint. The collectors monitor the spans passing through them, looking for root spans of the traces started with this sampling policy

This identification is done use sampler.type tag I believe

jack-berg commented 2 years ago

So instead of making changes in the SDK, would restricting the changes to jaeger remotesampler sdk-extension help?

It might be. Still trying to learn more about the problem. Specifically, I'm trying to understand if there is a standard around which attributes are expected. What backend are your services connecting to for jaeger remote sampling? Is it open source or something custom built? What's the minimal amount of information a backend would need in order to achieve your goal. For example, would it be sufficient to include a single attribute sampler.type=JaegerRemoteSampler?

MaheshGPai commented 2 years ago

The adaptive sampling part is a custom built solution. I'm not sure if there is any opensource one for opentelemetry. I came across this which is still open https://github.com/open-telemetry/opentelemetry-specification/issues/691 As for the minimal attributes required, the one you have mentioned (sampler.type) is good enough.

jack-berg commented 2 years ago

I think we should narrow add sampler.type attribute to the jaeger remote sampler, and punt on adding the other attributes until a use case comes up where they are needed.

trask commented 2 years ago

The adaptive sampling part is a custom built solution. I'm not sure if there is any opensource one for opentelemetry

@MaheshGPai check out https://github.com/open-telemetry/opentelemetry-java-contrib/blob/31b5b5eb21f1f5c7c3783ae00268e7d91ff903cd/consistent-sampling/src/main/java/io/opentelemetry/contrib/samplers/ConsistentSampler.java#L109-L118, I am using it with good results

MaheshGPai commented 2 years ago

@jack-berg That should be fine. I can update the PR (#4870) with this.

@trask Thanks a lot for pointing this. Looks interesting, will take a look. But one challenge is that we have mix of services on Opentracing & Opentelemetry. I guess till all services move to Opentelemetry we cannot completely rely on this? I'm yet to go through this is detail. Probably I'm missing something.

trask commented 2 years ago

we have mix of services on Opentracing & Opentelemetry

the simplest approach in this case is to do sampling at head node(s) of your distributed traces, and use a parent-based sampler for the downstream nodes which just respect the initial sampling decision.

though you don't get some of the extra benefits of the otel "consistent sampler" linked above, which include supporting span-to-metrics pipelines and ability to have different sampling rates on different nodes, unless you go all in with a "consistent sampler" across all nodes. here's the full docs on the otel "consistent sampler" approach (it's a heavy read 😅): https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md#tracestate-probability-sampling. also note that it's still in experimental status and will likely still undergo changes (so if you have a custom solution that's working for now that could be a good thing 👍)