w3c / trace-context

Trace Context
https://w3c.github.io/trace-context/
Other
470 stars 76 forks source link

Do we need sampling to be more than one bit in trace-parent to define priority #8

Closed codefromthecrypt closed 6 years ago

codefromthecrypt commented 7 years ago

The first draft of the http header spec had 140 comments in various areas. Not all came to closure. One that remained was if we do anything about vendor-specific sampling priority data, or relegate that as a stakeholder in the key/value aka baggage functionality.

Here's background from the original pull request (#1)

One line of comment started from @bhs This discussion was about flags and why we had initially 4bytes allocated, more room implied problems with interpretation such as endianness. Bogdan had reserved extra space as some proposed they needed room for sampling. This was reduced to a single byte and a question raised about if we should extend if for sampling priority. This particular line of comment went quiet.

Another started afterwards from @SergeyKanzhelev Sergey suggested that since sampling implementation would be vendor-specific, why not use a separate header? For example, not all vendors do consistent sampling per trace (ex some do sampling again at span level). Yuri mentioned benefits of a consistent trace-level algo. Sergey agreed with this and also that a sample bit as a force-trace operation is useful. However, he believed this should be a "debug" flag, not a sampling one. He suggests the sampling info should be vendor-specific.

After some chatter from Adrian, @SergeyKanzhelev maintained "I still think it would make sense to detach sampling/debug flag from the identity information."

The last comment on this sampling was from @bhs, when mentioned that one can stash the sampling rate (or reciprocal sampling rate) if arbitrary key/value propagated tags were supported somehow, and that presence of a sampled bit could be ignored/overlooked when they don't match a model.

tylerbenson commented 7 years ago

Overall I like the direction of this project. I think there would be significant value in keeping the core context propagation simple, but allow it to be extended for additional details that aren't central to maintaining a consistent trace flow. Examples that have been brought up are:

I think being able to handle these kind of things should be in the scope of this spec, but I am open to discussion on whether that is handled in one header or a separate standardized header (or set of headers).

bogdandrutu commented 7 years ago

I would like to keep this header as minimal as possible, with common things that are shared between multiple companies/implementations. If we achieve consensus to send trace_id/span_id/options in a common header each vendor can use a different vendor for their specific things and use this for these values. This will guarantee them that if on the other end they are talking for example with a cloud provider like Google or Azure tracing can be used to tie client requests with internal cloud requests for the moment.

tylerbenson commented 7 years ago

The problem I have with saying the spec will only help maintain context of the common shared info is it creates a huge barrier for vendors who need to pass along something not in the common area. They basically have to recreate all the functionality for maintaining the context in each transport type themselves. This is why I proposed including a header for both the efficient standardized common info, and the flexible vendor specific context data.

yurishkuro commented 7 years ago

@tylerbenson could you please elaborate? I am not sure I follow why it creates a "huge barrier" for vendors. The way I see it, a trace context is a combination of some metadata that can be standardized on, like the trace/span IDs in this spec, some baggage (which I think the spec also needs to standardize on), and vendor-specific metadata. Nothing prevents vendors from sending additional headers with their custom metadata, but if the request crosses the boundary between services instrumented with different tracing vendors, the common standard metadata can still be understood due to this spec, and vendor-specific metadata will be lost, which is unavoidable since you're entering a different vendor space anyway.

If your concern is with a case of A->B->C service calls where A and C are using tracing X and B uses tracing Y, and you want vendor-specific data to reach C, then I think this can be done by placing that metadata into baggage (which is why it's important to standardize on).

bogdandrutu commented 7 years ago

@yurishkuro +1

I started a new issue with open-questions about baggages/tags https://github.com/TraceContext/tracecontext-spec/issues/13

tylerbenson commented 7 years ago

@yurishkuro my question... will vendors have to come up with their own serialization and propagation for vendor specific metadata (do I need to get the request object, pull my info from the header, etc), or will the spec define how and where vendors should store data and provide ways for interaction with that standardized format via an API? Perhaps vendors could just use the existing "baggage" for this purpose?

yurishkuro commented 7 years ago

@tylerbenson this spec is only concerned with describing how the context looks on the wire, it has no opinion on how implementations get it into that format. A specific example of the format is here: https://github.com/TraceContext/tracecontext-spec/issues/13#issuecomment-329980665. The point of baggage is that implementations propagate it "no questions asked", so if say your implementation needs to propagate some additional (stable per trace) metadata like sampling rate, you can add it as one of the baggage items and be reasonably sure it will come back to you. That's assuming you have a case where services A1..An and C1..Cn are instrumented with tracing system X and service B1...Bn are instrumented with tracing system Y and the flow is Ai->Bj->Ck. For interactions Ai->Aj you don't even need this Spec as you can pass your proprietary metadata via any custom header, but it will get lost when you go Ai->Bj.

One danger with baggage is key clashes, we should also discuss some name spacing technique (I'll comment on #13).

codefromthecrypt commented 7 years ago

During the tracing workshop in Sydney, we discussed making sampling status a mandatory part of the trace options. Sampling recommendation as a tri-state value of deferred, true or false seems reasonable to all parties (allocated of 2 bits in the bitset). This included everyone present at the workshop including Google, Amazon, New Relic, DataDog, Dynatrace, Bas, Mick, Adrian and folks at Tyro.

AloisReitbauer commented 6 years ago

Does this one still matter after our latest decisions on trace-parent and trace-state? This should be a vendor-specific problem now.

SergeyKanzhelev commented 6 years ago

@AloisReitbauer this question was more like "is a single sampling flag in traceparent is too little or too much"? On one hand we do not need this flag as every vendor free to decide by herself. On other hand - if we putting it there should be put more generic number that allows more values?

I think we can keep it as it for now. I'm ok to close this issue so it can be re-opened later if needed

yurishkuro commented 6 years ago

The way I read this issue is it's primarily about extending the sampled bit with extra data like sampling priority (not the sampling probability!), e.g. indicating not just the upstream's desire to sample but the degree of sampling such as light-weight vs. heavy profiling. I'd be interested to hear if any existing tracers actually do this, as quite a few tracers I'm familiar with don't support this.

codefromthecrypt commented 6 years ago

Here with @tylerbenson and we need to clarify at least if the options field should be treated as a bitset or not. For example, if so, we can reserve the least significant bit for the purpose it has now, but also allow other options to combine with it. For example '3' would also be sampled.

SergeyKanzhelev commented 6 years ago

@adriancole trace options is already described as bitset. Does your question read as "can tracing system decide to use one of the flags for itself"? I.e. make it free for grabs and interpretation? Or you want to specify another bit purpose in specification for everybody to follow?

codefromthecrypt commented 6 years ago

I added this to clarify that trace-options is a bit field. It is mostly correct just doesn't introduce it as a bit field. Experience in X-B3-Flags suggest programmers routinely get bit fields wrong so added some advice as well https://github.com/w3c/distributed-tracing/pull/116

codefromthecrypt commented 6 years ago

There are two parts:

If the latter is true, then folks like datadog need to encode their extra bit into a custom format. Otherwise they could fit into the generic one.

yurishkuro commented 6 years ago

folks like datadog need to encode their extra bit

which extra bit do they have?

codefromthecrypt commented 6 years ago

one bit extra is very similar to zipkin debug flag. basically it emphasizes not only to sample, but especially try to keep it when dropping others

On Sat, 21 Apr 2018, 11:10 Yuri Shkuro, notifications@github.com wrote:

folks like datadog need to encode their extra bit

which extra bit do they have?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/w3c/distributed-tracing/issues/8#issuecomment-383263335, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD614oBd2a8nB9y0h3ybpUh--kAuK5uks5tqqMKgaJpZM4PRiFF .

SergeyKanzhelev commented 6 years ago

The difference I understand between sampling and debug flags is the need to gain up. When sampling is set and you collected the span you need to gain up values if you calculate any aggregation function based on this span. Like average duration. If you collected because of a debug flag - you count it as a single request. Is it the same for datadog? Or it's simply force instead of recommendation?

SergeyKanzhelev commented 6 years ago

It is very beneficial to share the sampling approach among components. Even with different tracers. So respecting bit flag and having it in the main spec is a goodness.

For systems where you are paying for telemetry per component a single flag approach will not work unless those components will share extra hint on sampling priority. A good solution is to propagate "sampling score" with the trace (more than a bit).

It will be useful to standardize on sampling score format so it can be shared between different vendors. So it calls for tracestate entry that is a property of trace, not a vendor-specific position in a graph.

Do you know of anybody propagating sampling configuration alongside the trace. So gain ups may be calculated in the downstream components correctly?

codefromthecrypt commented 6 years ago

Tyler could speak more on DD but yesterday he was quite clear on all of this being recommendation because the agent or otherwise may still choose to override the "priority" decision and drop data.

CodingFabian commented 6 years ago

From Instana point of view, we support non-binary sampling / trace details internally, but we never really needed that. if this would be part of traceparent, the semantics would need to be "clear" across vendors, which seems ambitious. So our stance right now is not to have a non-binary sampling/priority flag.

beberlei commented 6 years ago

is this interesting? A binary flag could say “we want this more than other traces” like in Zipkin mapping to the debug flag. Question: Do we need more bits to encode more information? From API side in OT: Often questions about how to interpret this information by users. Adding this would require all implementers to at least pass it on. Vendors would probably just pass on and ignore and transport this information in their TraceState. Security reasons were added, because it could be used for DoS. Two understanding, not resolved for now.

tylerbenson commented 6 years ago

Sorry for the delay, but to be clear, I wasn't suggesting we add the additional (vendor specific) flag into the bitfield. We should be able to include that added information in tracestate.

SergeyKanzhelev commented 6 years ago

Two flags are defined now that may control sampling. First indicate whether recording was requested and another indicates whether parent recorded this span. Closing this issue. Let's re-open issues for specific concerns. I believe original concerns were addressed.