open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.64k stars 871 forks source link

Extra features supported by Jaeger clients #231

Open yurishkuro opened 4 years ago

yurishkuro commented 4 years ago

It would be great if in the future we could decommission Jaeger client libraries, which take non-trivial effort to support in all languages, and replace them with OpenTelemetry SDKs. @bogdandrutu asked me to enumerate additional features supported by Jaeger clients that are not currently supported by OpenTelemetry, to inform future roadmap after v1.

Remotely configurable sampling

Jaeger clients usually consult Jaeger backend for the sampling strategies to use. This is implemented as a polling clients -> agent -> collector, usually once a minute. The sampling can be statically configured on the backend, or automatically calculated to meet certain throughput goals. The sampling is controlled at the granularity of service + operation (aka span name), so that services (like API gateways) with endpoints that have vastly different QPS can sample different endpoints appropriately.

Firehose mode

Jaeger trace state contains a flag that indicates a firehose mode, in which traces are written to cheap storage and only accessible by trace ID, without indexing. This is useful when there are other upstream means of locating traces (e.g. trace ID is logged as part of an integration test), and allows higher throughput in the storage layer compared to fully indexed traces.

Setting debug flag

Jaeger trace state has a debug flag that tells the backend to try its best to sample the trace. For example, if the backend implements additional consistent downsampling (for capacity control), the traces with debug flag will avoid this downsampling.

From the API endpoint this is done by setting sampling.priority=1 tag on the root span.

In addition, the debug flag can be set by the user even before the trace is created, by including a special header jaeger-debug-id: anything. When Jaeger sees this header in the incoming request, it's equivalent to setting sampling.priority=1 and jaeger-debug-id=$value tags on the span. Storing the header value as a correlation ID allows finding the trace later. E.g. I can send a curl request with jaeger-debug-id: yuri-test-1.

Setting baggage

Similar to debug flag, there is a header jaeger-baggage: k=v,k=v that can be set by user before the trace even exists.

Baggage restrictions

This one is a bit iffy in terms of usefulness, but Jaeger clients also support remotely configurable way to restrict which services can set which baggage keys, as well as key/value lengths, etc.

Ad-hoc sampling policies

This is currently work in progress that I mentioned on the Sampling RFC. It's similar to Facebook's feature where users can centrally configure ad-hoc sampling policies to collect data exhibiting certain patterns, e.g a specific tag or a header or combination. Note that this is not after-the-fact sampling like "sample if there is an error or unusual latency", our ad-hoc sampling is still mostly upfront. The main reason I mention it, even though it doesn't exist yet in Jaeger, is because it requires certain changes to the Sampler API in the SDK so that it can take into account various pieces of the span data like tags, etc.

bogdandrutu commented 4 years ago

Thanks I will transform this in individual issues to be addressed by the OpenTelemetry implementations.

One quick question from what I read: what is the interaction between baggages and traces?

yurishkuro commented 4 years ago

There isn't much interaction. We usually log the baggage to the span when it is set, but aside from that baggage is a runtime thing.

pavolloffay commented 4 years ago

One missing functionality is the ability to configure the client via environmental variables https://www.jaegertracing.io/docs/1.13/client-features/. This was proven to be useful for cloud/containerized deployments.

bogdandrutu commented 4 years ago

@yurishkuro for the:

Baggage restrictions

This one is a bit iffy in terms of usefulness, but Jaeger clients also support remotely configurable way to restrict which services can set which baggage keys, as well as key/value lengths, etc.

Does the length apply to incoming baggage key/value or only the once set by the process? For the baggage keys configuration what is the behavior if the code tries to add a key that is not allowed by the config?

bogdandrutu commented 4 years ago

@pavolloffay the configuration can live in the Jaeger "exporter". I think we use a wrong name for exporter maybe rename it to Jaeger "client".

The idea is that we have the SDK that allows:

Then the Jaeger "client" will depend on the SDK and:

pavolloffay commented 4 years ago

@bogdandrutu I have also created a generic issue for SDK configuration https://github.com/open-telemetry/opentelemetry-specification/issues/232. Some configuration is indeed jaeger specific, however some properties apply to the whole SDK: specify the resource (service name...), reporter to use, propagation...

Starts a new timer that every 60 seconds reads the sampling config from the Jaeger backend, and if anything changes changes the SDK trace config;

That would be nice, we didn't have config watchers in jaeger.

yurishkuro commented 4 years ago

@bogdandrutu

Does the length apply to incoming baggage key/value or only the once set by the process?

We've only implemented restrictions of baggage items set by the process, at the time they are set. No restrictions on propagated baggage.

For the baggage keys configuration what is the behavior if the code tries to add a key that is not allowed by the config?

It is not set, and a log entry is added to the span (log entry is added in all cases, btw).

austinlparker commented 2 months ago

Due to the age of this issue, the GC is interested how many of these topics are still relevant, what your current requirements are (if they have changed), and if this issue could be split into smaller issues.

yurishkuro commented 2 months ago

Remote sampling configuration is still important, others are nice to have but I did not hear much demand for them, including from Uber folks who migrated to OTEL SDKs (cc @vprithvi).

austinlparker commented 2 months ago

Remote sampling configuration is still important, others are nice to have but I did not hear much demand for them, including from Uber folks who migrated to OTEL SDKs (cc @vprithvi).

Would the remote sampling ask be handled by OpAmp support in SDKs/collector?