Define what behavior is expected when X-B3-TraceId is present, but not X-B3-SpanId

codefromthecrypt commented 8 years ago

question here https://github.com/spring-cloud/spring-cloud-sleuth/issues/400#issuecomment-246866057 from @shalako

codefromthecrypt commented 8 years ago

X-B3-TraceId must be present with X-B3-SpanId to affect the trace identifiers. If they aren't specified together, whichever is present will be ignored and a new trace and span id will be provisioned.

https://github.com/twitter/finagle/blob/989edc58084fa01b6e6557ff5b6adda594ba469f/finagle-http/src/main/scala/com/twitter/finagle/http/Codec.scala#L371

codefromthecrypt commented 8 years ago

PS I know many folks aren't watching this repository, so I'll do a one-time spam for B3-compliant tracer authors. The aim is to capture B3 here and also ideally an integration test for it (maybe crossdock). I hope some of you will choose to watch this repository (so as to make sure things make sense)

@nicmunroe @felixbarny @yurishkuro @jcarres-mdsol @abesto @eirslett @kristofa @michaelsembwever @kevinoliver @mosesn @marcingrzejszczak @prat0318 @devinsba @basvanbeek @schlosna @ewhauser @klingerf

codefromthecrypt commented 8 years ago

nicmunroe commented 8 years ago

Apologies in advance if I'm missing something and the following is dumb.

In the case of receiving a span ID but no trace ID I can understand throwing stuff away and creating new IDs as I don't see any benefit to honoring a span ID without knowing what trace it's attached to. But in the case of receiving a trace ID without a span ID that's still potentially useful info; it's true that you wouldn't know which span ID to use as the parent, but just knowing that a span is part of a given request even if it's "detached" is potentially useful information in my mind.

This would only really come up in two cases:

Broken caller impl. In this case my concern above would apply. Yes, they are broken and should not send bad requests, but since we have a graceful fallback (honor trace ID and create new span ID) that provides useful info in the end (knowing that a span was part of a request even if you can't attach it to a parent) shouldn't we do the graceful thing? Not to mention that if you drop the trace ID you're losing out on log tagging using MDC from that point forward.
The caller wants you to use a specific trace ID for the request for debugging / call association purposes. We've run into cases where it's been incredibly useful for some clients (e.g. restrictive mobile apps) to be able to specify the trace ID when they make the initial request in order to force that trace ID for that request.

I'm probably missing some important points about "why". Let me know what you think.

shalako commented 8 years ago

Honoring the trace when present makes sense to me, too, whether or not a span is present.

codefromthecrypt commented 8 years ago

Prologue:

A reverse-documentation effort smokes out concerns in the original implementation. So, let me first mention that by documenting what's been the case for years doesn't mean I agree with it, or decided it. In things like this in Zipkin, I usually take the stance of first document what exists. Then, if important enough help push the many parties to change. The latter is very expensive even when people agree.

On requiring both, and not accepting degraded propagation:

The coupling was required since its original commit in 2011 https://github.com/twitter/finagle/commit/8939f31862af0abf5d88c79f5a8011ede6513bbe Perhaps @johanoskarsson remembers why, I wouldn't. My 2p it that it reasonable to not accept broken instrumentation, especially if you have no defined way to log "broken". We do have a way now, via the "error" annotation.

The way I see it is that people can choose to start a new span given broken propagation, but if we want to make this helpful, we should probably report that it is broken. One thing that has come up (for example reported by @kristofa) is that misleading traces are worse than no trace. One way we could do this is add either the existing "error" annotation or define a broken instrumentation annotation and use that.

@nicmunroe on part 2: If the user can send trace id, they can also send a span id, right? Let me know if https://github.com/openzipkin/b3-propagation/issues/4#issuecomment-247493176 doesn't cover that.

codefromthecrypt commented 8 years ago

@yurishkuro I think you have done some work in analyzing tracers.. do you have means to tag bad or out-of-date instrumentation? ex in finagle, the version is logged, so if there's an issue with one version, you could analyze on that.

yurishkuro commented 8 years ago

we do log the version of the tracing client library, but we do not bother processing malformed traces from the wire, we just start a new one. It's a bit different in our case because aside from baggage all of the span context is encoded as a single string in a single header, so the use case of "have trace id, don't have span id" is very unlikely to happen (unless the header itself got partially chopped off by some http proxy, but then we just won't parse it at all).

nicmunroe commented 8 years ago

@nicmunroe on part 2: If the user can send trace id, they can also send a span id, right? Let me know if #4 (comment) doesn't cover that.

@adriancole it does cover responsible callers who do their research before integrating, but it's happened more than once where a time-crunched team assumes they know what's required and only end up sending trace ID. The problem with simply dropping that on the floor is that the problem is usually only discovered during an important error debugging session, at which point it's too late.

Since we use MDC tagged logging to put trace IDs into every log message and use a log aggregator to be able to search for all log messages across all microservices related to a given trace ID when investigating that request, switching trace IDs halfway through the request when we technically don't have to would be a big problem. I know that's a secondary use case for tracing and not supposed to be the primary consideration, but it's proven itself to be so incredibly useful that I don't think we can throw broken implementors under the bus. After all, the broken implementors are the ones most likely to need the extra debugging capabilities, yes?

I don't think our use case should trump everything else and cause the spec to change - just trying to provide our perspective.

codefromthecrypt commented 8 years ago

@nicmunroe @jcarres-mdsol @shalako

Ok, well we can't rewrite history, but we can affect this moving forward. How about this?

X-B3-TraceId should be present with X-B3-SpanId to ensure that a span is placed at the correct place in the tree. Historical instrumentation start new traces when X-B3-SpanId is absent.

Instrumentation who tolerate absent X-B3-SpanId have the following behavior:

X-B3-SpanId is derived as the right-most 16 characters (64bits) of the X-B3-TraceId
The implicit span is assumed to be the root of the trace, so this works fine once per trace.

When multiple calls propagate the same X-B3-TraceId, but not X-B3-SpanId, multiple "sr" annotations will end up in the root span, corrupting it. If this happens, correct instrumentation to pass a unique X-B3-SpanId.

nicmunroe commented 8 years ago

@adriancole ok, a few things are clicking for me now and I think I see where some of the disconnect on my part was:

Wingtips generates a new random span ID for everything, including root span (it doesn't share trace ID for the root). So when a call comes in with trace ID specified but no span ID we end up creating a new unique span ID. There's never multiple root spans in our case - we'd just end up with a "disconnected" span that has no parent but is otherwise unique (which is unfortunate but not as bad as two spans sharing the same span ID).
The wingtips idiom is to have all spans be started and completed in the same process - we don't have cases where a span is started in one process and finished in another. So I'm guessing we suffer fewer negative consequences from continuing a broken trace (?).

Again, this is not to say that what wingtips is doing is better; I mainly want to make sure what we're doing isn't wrong or going to cause other nasty problems. Assuming what we're doing is ok then I'm happy with the documentation stating what happens historically along with what other options are acceptable.

codefromthecrypt commented 8 years ago

This sounds similar to grpc. In grpc, they propagate the parent (client) id only. Then, when it is received, the server uses that parent, generates a new span id for the server span, and moves on.

In B3 (today anyway), the caller propagates the span id for the RPC.

For single-host spans, the client span id is propagated, but used differently.

Let's say the client has parentId 1 and spanId 2. The client propagates its trace parent and span id. Doesn't matter if this is wingtips or not.

If the server wants to use shared spans, it uses exactly the same ids for the "sr" event.

If the server wants to do separate spans for client and server, it uses the propagated span id as its parent and generates as new id.

If there is no span id, with the change we mentioned, it just provisions a new span id (it doesn't have to use lower 64 bits of the trace even though that's conventional).

The catch is when a caller thinks the server is going to use the span id it propagated (because it assumes shared spans but the server doesn't do that). That id still exists (as it is recorded as the parent id of the new server span). However, it may not have any annotations in it at all (if the caller didn't report that id). What's the impact?

Well, there's no impact in zipkin. Zipkin has no search by span id anyways. We also have tests to cover headless trace (when the root span id is never reported, we treat the root-most as root)

There could be impact to log tools that assume all span ids end up in logs. This isn't something we have scope to impact at the moment, and in some cases it will still work. Ex in finagle the string logged includes all trace identifiers (trace parent and span id)

So... does this help clarify things about impact to wingtips (which is same as grpc and maybe jaeger I think)

On 19 Sep 2016 01:29, "Nic Munroe" notifications@github.com wrote:

@adriancole ok, a few things are clicking for me now and I think I see where some of the disconnect on my part was:

Wingtips generates a new random span ID for everything, including root span (it doesn't share trace ID for the root). So when a call comes in with trace ID specified but no span ID we end up creating a new unique span ID. There's never multiple root spans in our case - we'd just end up with a "disconnected" span that has no parent but is otherwise unique (which is unfortunate but not as bad as two spans sharing the same span ID). The wingtips idiom is to have all spans be started and completed in the same process - we don't have cases where a span is started in one process and finished in another. So I'm guessing we suffer fewer negative consequences from continuing a broken trace (?).

Again, this is not to say that what wingtips is doing is better; I mainly want to make sure what we're doing isn't wrong or going to cause other nasty problems. Assuming what we're doing is ok then I'm happy with the documentation stating what happens historically along with what other options are acceptable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nicmunroe commented 8 years ago

Yeah in wingtips the server always separates the spans by taking the client's span ID as the server's parent span and creating a new span ID for the server span. So it sounds like that's a reasonable thing to do and not unheard of. Thanks for taking the time for all these explanations!

yurishkuro commented 8 years ago

X-B3-SpanId is derived as the right-most 16 characters (64bits) of the X-B3-TraceId

that doesn't sound like a good practice. If the client app that only sends trace id makes a bunch of calls, each call will register as the same span id.

codefromthecrypt commented 8 years ago

X-B3-SpanId is derived as the right-most 16 characters (64bits) of the X-B3-TraceId

that doesn't sound like a good practice. If the client app that only sends trace id makes a bunch of calls, each call will register as the same span id.

Well, we were defining what bad practice would look like, as B3 expects a span id to be sent! :) I did document that this would register as the same span id, as if we think this through, what's the better option? When there's no parent, we have a choice of

make a mostly valid root span (by using a consistent span id function)

If they reuse that multiple times, they corrupt that span, but it will still be renderable.

make a mostly valid root span (by using an inconsistent span id function)

If they reuse it multiple times, it will appear as though there are multiple root spans (multiple spans with no parent). I'm not even sure if zipkin will return that.

I lean towards asking people to actually implement the specification, and if they don't, make it possible to see the error, rather than define an undefined shape (a trace with multiple roots). Am I missing something?

yurishkuro commented 8 years ago

@adriancole While I am intimately familiar with the pain of bad instrumentation, I agree that it's better to fix that instrumentation than to document weird scenarios of what should happen if half the info is missing. @nicmunroe 's argument that "you get burnt by bad instrumentation when you need it the most" is somewhat flawed, because if bad instrumentation is already in production, then the tracing system is already processing the bad data and can actually identify bad players proactively, without waiting for an outage to happen.

jcarres-mdsol commented 8 years ago

I'd say just error. Both headers are necessary.

nicmunroe commented 8 years ago

@yurishkuro I agree with you in principle that bad instrumentation is possible to detect proactively, but not every organization is mature enough or has built the proper tools and alerting to stay on top of that. In the wild many teams don't have a solid grasp of distributed tracing and will do the wrong thing due to not reading the spec, or a bug, or what have you (and it will always be worse for orgs as they get started).

I think it's fine for the spec to be strict and simply state it's an error to send trace ID without span ID, and not document weird scenarios. But in the spirit of "be conservative in what you send, be liberal in what you accept" robustness, wingtips will likely continue to handle that case by continuing the trace with a new random span ID as the disconnected span still allows for bad-instrumentation-detection, and having all parts of the request tagged with the same trace ID still provides tangible benefit for callers as they work through their instrumentation issues.

codefromthecrypt commented 7 years ago

ps I thought about this recently, and one thing about this thread is that it seems there's multiple valid policies, of which context decides what's best.

One way out is to document a few policies in practice, what's more used (if there's a more used) .. in something like an appendix.

Ex.

ExtractFailurePolicy.RESTART
- default and what exists on most tracers
ExtractFailurePolicy.ERROR
- throw error knowing it might break apps, like having assertions or linting on
ExtractFailurePolicy.TAG_AND_RESTART
- add a tag to the span including the cause of the extraction failure, but restart normally

I'd concede that most of these choices are library and possibly span format specific, but anyway food for thought.

this recently broke out from separate discussions with @basvanbeek and @pavolloffay on how to handle propagation nuance, as these sorts of things can be insidious in practice

openzipkin / b3-propagation

Define what behavior is expected when X-B3-TraceId is present, but not X-B3-SpanId #3