opentracing / opentracing.io

OpenTracing website
https://opentracing.io
Apache License 2.0
116 stars 134 forks source link

non-RPC spans and mapping to multiple parents #28

Closed codefromthecrypt closed 7 years ago

codefromthecrypt commented 8 years ago

One of my goals of working in OpenTracing is to do more with the same amount of work. For example, when issues are solved in OpenTracing, and adopted by existing tracers, there's a chance for less Zipkin interop work, integrations and maintenance. Zipkin's had a persistent interoperability issue around non-RPC spans. This usually expresses itself as multiple parents, though often also as "don't assume RPC".

In concrete terms, Zipkin V2 has a goal to support multiple parents. This would stop the rather severe signal loss from HTrace and Sleuth to Zipkin, and of course address a more fundamental concern: the inability to express joins and flushes.

In OpenTracing, we call certain things out explicitly, and leave other things implicit. For example, the existence of a span id at all is implicit, except the side-effect where we split the encoded form of context into two parts. We certainly call out features explicitly, like "finish", and of course these depend on implicit functionality, such as harvesting duration from a timer.

Even if we decide to relegate this to an FAQ, I think we should discuss multiple parents, and api impact. For example, are multiple parents tags.. or attributes? Does adding parents impact attributes or identity? Can an HTrace tracer be built from an OpenTracing one without signal loss? Are there any understood "hacks" which allow one to encode a multi-parent span effectively into a single-parent one? Even if we say "it should work", I'd like to get some sort of nod from a widely-used tracer who supports multiple parents.

The practical impact of this is that we can better understand in Zipkin whether this feature remains a zipkin-specific interop story with, for example HTrace, or something we leverage from OpenTracing.

For example, it appears that in AppNeta, adding a parent, or edge is a user-level task, and doesn't seem to be tied with in-band aka propagated fields? @dankosaur is that right?

In HTrace, you add events and tags via TraceScope, which manages a single span, which encodes into its id a single primary parent. You can access the "raw" span, and assign multiple parents, but this doesn't change the identity of the span, and so I assume doesn't impact propagation. @cmccabe is that right?

I'm sure there are other multiple-parent tracers out there.. I'd love to hear who's planning to support OpenTracing and how that fits in with multiple parents.

bhs commented 8 years ago

@adriancole I'm glad you brought this up – an important topic. I am going to dump out some ideas I have about this – no concrete proposals below, just food for thought.

<ramble>

I was careful to make sure that a dapper- or zipkin-like parent_id is not reified at the OpenTracing level... that is a Tracer implementation concern. That said, the Span and TraceContext APIs show a bias for single-parentage traces (if this isn't obvious I can elaborate). OpenTracing docs even describe traces as "trees" rather than "DAGs".

In the current model, multiple parents could be represented as Span tags or – I suppose – as log records, though that latter idea smells wrong. Trace Attributes do not seem like the right fit since parentage relationships are a per-Span rather than per-Trace concern. (On that note: IMO the parent_id should never be a part of the TraceContext as there's no need to send it in-band over the wire... it can just be a Span tag.)

Let me also throw out this other related use-case that I think about often: delays in "big" executor queues, e.g. the main Node.js event loop. If each such executor has a globally unique ID and spans make note of those unique IDs as they pass through the respective queue, a sufficiently dynamic tracing system can explain the root cause of queuing delays (which is an important problem that is usually inscrutable). To be more concrete, suppose the following diagram illustrates the contents of a FIFO executor queue:

    [  C  D  E  F  G  H  I  J  K  L  ]
                                  ^-- next to dequeue and execute

Let's say that the Span that enqueued C ends up being slow because the items ahead of it in this queue were too expensive. In order to truly debug the root cause of that slowness (for C), a tracing system should be talking about items D-L... at least one of them took so long that the wait to get to the front of the executor queue was too long.

So, the big question: is C a parent for D-L? After all, it is blocked on them, right? And if C is a parent, what do we say about the more direct/obvious parents of D-L, whatever they are?

Anyway, this example is just meant to provide a practical / common example of tricky causality and data modeling. There are analogous examples for coalesced writes in storage systems, or any time batching happens, really.

</ramble>

yurishkuro commented 8 years ago

I think this should be another page on the main website - recipes for handing the scenarios mentioned above, and others we discussed on various issues, like marking a trace as "debug". The goal of OpenTracing is to give instrumenters a standard language to describe the computation graph shape, regardless of the underlying tracing implementation, so we cannot give answers like "this is implementation specific", or "this could be done like this" - the answer needs to be "this is done this way", otherwise instrumenters can walk away none the wiser.

Of course, it is also helpful to know the exact use case the user is asking about. For example, it's not clear to me that the queueing/batching @bensigelman describes is a use case for multiple parents. The main answer the users want is why it took my span so long to finish. So the investigation could be done in two steps, first the span logs the time when it was enqueued and when it was dequeued and executed. If the gap is large, it already indicates a delay on the event loop. To investigate the delay, user can run another query in the system asking for spans that too long to actually execute once dequeued, and thus delayed everybody else. A very intelligent tracing system may be clever enough to auto capture the items ahead in the queue based on the global ID of the executor, but we still need to have a very precise recipe to the instrumentors of what exactly they need to capture regardless of the underlying tracing implementation.

Going back to the multi-parent question, do we understand which scenarios actually require it?

As for capturing multiple parents, I would suggest using span tags for that, i.e. we declare a special ext tag and do

for parent in parent_spans:
    span.set_tag(ext.tags.PARENT, parent.trace_context)

(which is another reason I was lobbying for string->[]any, and it's still possible to do the above based on the current API and semantics).

bhs commented 8 years ago

@yurishkuro yes, my point was not that the queue is a good fit for multiple-parentage (it's not) but more to redirect the conversation around motivating scenarios rather than a particular DAG structure. This is very much in line with your suggestion (which I heartily endorse) that we provide sensible, opinionated guidance for best-practice instrumentation of certain archetypical scenarios: RPC boundaries, high-throughput executor queues, coalescing write buffers, whatever.

As for relying on the "intentionally undefined" semantics of multiple calls to set_tag, I would perhaps prefer forcing the multiple parents to be set all at once and just string-joining them into a single tag value. This would be the "making the hard things possible" of the old make the easy things easy and the hard things possible adage (i.e., it's admittedly clumsy):

parent_span_ids = map(lambda s: str(s.trace_context.span_id), parent_spans)
span.set_tag(ext.tags.MULTIPLE_PARENTS, strings.join(parent_span_ids, ","))
yurishkuro commented 8 years ago

Why not just pass the array of parent trace contexts?

span.set_tag(ext.tags.PARENTS, [parent.trace_context for parent in parent_spans])

Passing string IDs is not portable, we don't even have trace_context.span_id as a formal requirement in the API.

bhs commented 8 years ago

I was respecting the BasicType-ness of set_tag's second parameter... I was mainly just illustrating the concatenation/joining. (Coercing a TraceContext into a BasicType or string is a problem regardless of set_tag/add_tag)

dkuebric commented 8 years ago

Not to beat a dead horse, but I agree that queue depth is not a good use-case for multiple parents. (It's tempting, for instance, to take that on a slippery slope all the way up to OS-level scheduling!) IMO distributed tracing is about understanding a single flow of control across a distributed system--concurrent work may factor into how that request was handled, but tracing should keep the unit of work being traced as the center of reported data, with info in a trace being "relative to" that trace.

The use-cases I see for multiple parents are around asynchronous work done in a blocking context--in the service of handling a single request or unit of work (to distinguish from the event loop case above). It is tempting to say that the join is optional, because someone reading the trace can probably infer a blocking/nonblocking relationship from the trace structure. However, the join is valuable information for those unfamiliar with the system who are reading individual traces, or for a tracer which wants to do more sophisticated analysis on the corpus of traces, because it signals where asynchronous work is actually blocking critical path in a manner which is less open to interpretation.

Some examples we see commonly in web apps are libcurl's curl_multi_exec (and the many libraries that wrap it), or libraries which are async by at the underlying implementation level but actually end up being used synchronously a lot of the time (spymemcached). Instrumenting these to capture both use-cases benefits from being able to distinguish between the two execution patterns.

In AppNeta's X-Trace implementation, multi-parent is also used to record the ID of a remote (server-side) span event when it replies to the client agent. This is largely because the methodology is based on events instead of spans. For instance, a remote call made with httplib in python would involve 2 events if the remote side is not instrumented (httplib entry, httplib exit), or 4+ if the remote side is instrumented. The httplib exit event would have edges to both the httplib entry and remoteserver exit in that case.

image

I like the idea of supporting this type of behavior, but it seems less pressing in a span-oriented world. The main argument I can see is an understanding of blocking critical path vs not in analysis of traces. I'm curious: are there other arguments for multi-parent out there? What is this used for in HTrace world?

(Also @bensigelman can you clarify your comment about not serializing parent_id? If the TraceContext is what is serialized going across the wire, shouldn't it hold a previous ID? I am probably missing something obvious here..)

bhs commented 8 years ago

@dankosaur per your question about parent_id: If we're using a span-based model, IMO an RPC is two spans, one on the client and one on the server. The client span's TraceContext is sent over the wire as a trace_id and span_id, and that client span_id becomes the parent_id of the server span. Even if a single span is used to model the RPC, as long as the client logs the span's parent_id there should be no need for the server to log it as well (so, again, no need to include it in-band with the RPC payload). Hope that makes sense... if not I can make a diagram or something.

dkuebric commented 8 years ago

Thanks, that makes sense--the span_id becomes the parent id of the receiving span. It's the same way in X-Trace.

cmccabe commented 8 years ago

As Adrian mentioned, in HTrace, we allow trace spans to have multiple parents. They form a directed acyclic graph, not necessarily a tree.

One example of where this was important is the case of writing data to an HDFS DFSOutputStream. The Java stream object contains a buffer. This buffer will be flushed periodically when it gets too big, or when one of the flush calls is made. The call to write() will return quickly if it is just storing something to the buffer.

Another example is in HBase. HBase has a write-ahead log, where it does "group commit." In other words, if HBase gets requests A, B, and C, it does a single write-ahead log write for all of them. The WAL writes can be time-consuming since they involve writing to an HDFS stream, which could be slow for any number of reasons (network, error handling, GC, etc. etc.).

What both of these examples have in common is that they involve two or more requests "feeding into" a single time-consuming operation. I think some people in this thread are referring to this as a "join" since it is an operation that joins several streams of execution (sometimes quite literally, by using Executors or a fork/join threading model).

We had a few different choices here:

1. Arbitrarily assign the "blame" for the flush to a single HTrace

request. In the DFSOutputStream, this would mean that we would ignore DFSOutputstream buffer flushes unless the HTrace request had to wait for them. In HBase, what we would do is rather less clear-- the requests that are being coalsced into a "group WAL commit" don't necessarily have any user-visible ordering, so the choice of which one to "blame" for the group commit would be completely arbitrary from the user's point of view.

In a world where we're using less than 1% sampling, solution #1 would mean that relatively few HDFS flushes would ever be traced. It also means that if two traced writes both contributed to a flush, only one would take the "blame." For HBase, solution #1 would mean that there would be a fair number of requests that would be waiting for the group commit, but have no trace spans to reflect that fact.

Solution #1 is simple to implement. As far as I can tell, most distributed tracing systems took this solution. You can build a reasonable latency outlier analysis system this way, but you lose a lot of information about what actually happened in the system.

2. Denormalize. If two traced writes came in, we could create "separate

trees" for the same flush. This solution is superficially attractive, but there are a lot of practical difficulties. Clearly, it increases the number of spans exponentially for each branching point. Since we had this problem at multiple layers of the system, this was not an attractive solution.

3. A more complex data model that had "extra edges" beyond the

parent/child relationships we traditionally used. For example, HDFS flushes could become top-level HTrace requests that were somehow associated with other requests (perhaps by some kind of "extra ID". The problem with this is that your tooling becomes much more complex and project-specific. It's already hard enough to explain the current simple data model to people without making it even more complex and domain-specific. We also have multiple layers at which this problem happens, so it would become harder for even experts to follow a single request all the way through the system.

4. Support multiple parents. This wasn't difficult at the model layer.

It made some things more challenging at the GUI layer, but not by much. Our programmatic interface for adding multiple parents is still a bit awkward-- this is something we might want to work on in the future.

I'm curious what you guys would suggest for solving cases like this one. We have tried to come up with something that was useful for Hadoop and HBase, and hopefully the wider ecosystem as well. I didn't see a lot of discussion about this in any of the tracing publications and discussions I read-- perhaps I missed it.

best, Colin

On Mon, Jan 18, 2016 at 1:38 PM, Dan Kuebrich notifications@github.com wrote:

Thanks, that makes sense--the span_id becomes the parent id of the receiving span. It's the same way in X-Trace.

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-172661063 .

dkuebric commented 8 years ago

@cmccabe thanks for the write-up! The group commit is a really interesting use-case, and because I also have not seen much discussion around this, I'd love to hear broader thoughts. Particularly about solution 3 you present above, because that's the one AppNeta takes with regard to such work.

The reasoning behind picking option 3, which results in what we call "meta-traces" which have parent and child trace relationships, is based in the desire to be able to do smart aggregate analysis on traces. If a request trace is always a blocking unit of work at the top level, then you can start to mine it for critical path, goal is optimizing for end-user performance (whether the end-user is a human or a machine doesn't matter). So we wanted a definition of a trace which had a blocking top-level span.

However, there's plenty of workloads that exhibit chained work patterns like queue insertion with a quick ack followed by downstream processing. These are also very important to trace, but can't be modeled using the above definition of a trace. (This type of behavior sounds parallel to the group commit case: something is written to a log, then later processed.)

For that reason, we decided a "meta-trace" is the path which holds the most semantic value: each "stage" of the pipeline/processing-graph can be analyzed as a separate application based on its traces, with its own dependencies, hot spots, etc. But also the entire meta-trace can be reconstructed for end-to-end tracing. This might include a many-to-one join in the case of things that batch processing (eg. writes), or a more simple waterfall and branching pattern for a lot of data pipelines.

yurishkuro commented 8 years ago

@dankosaur we are also considering using a model that sounds very much like your meta-trace, for capturing relationship between some real-time trace and work it enqueues for later execution. At minimum it requires a small extension of capturing a "parent trace ID". Does AppNeta expose a higher level API for users to instrument their apps to capture these relationships?

cmccabe commented 8 years ago

Thanks for the insight, Dan.

I agree that for the "async work queue" case, you probably want to create multiple HTrace requests which you can then associate back together later. However, this case seems a little different than the "synchronous join" case that motivated us to use multiple parents. After all, in the async case, you are probably going to be focused more on things like queue processing throughput. In the "synchorous join" case, you need to focus on the latency of the work done in the joined part. In the specific example of HBase, if group commit has high latency, all the HBase requests that depend on that particular group commit will also have high latency.

However, it would certainly be possible to model the HBase group commit as a separate top-level request, and associate it back with whatever PUT or etc. HBase request triggered it. I guess we have to think about the advantages and disadvantages of that more, compared to using multiple parents.

We've been trying to figure out the right model to represent things like Hive jobs, where a SQL query is broken down into MapReduce or Spark jobs, which then break down further into executors, and so forth. It does seem like we will end up splitting spans quite a lot, and potentially using foreign keys to knit them back together. In that case, it definitely makes sense. The most basic level of support would be tagging HDFS / HBase spans with the ID of the current MapReduce or Spark job.

best, Colin

On Mon, Jan 18, 2016 at 3:42 PM, Dan Kuebrich notifications@github.com wrote:

@cmccabe https://github.com/cmccabe thanks for the write-up! The group commit is a really interesting use-case, and because I also have not seen much discussion around this, I'd love to hear broader thoughts. Particularly about solution 3 you present above, because that's the one AppNeta takes with regard to such work.

The reasoning behind picking option 3, which results in what we call "meta-traces" which have parent and child trace relationships, is based in the desire to be able to do smart aggregate analysis on traces. If a request trace is always a blocking unit of work at the top level, then you can start to mine it for critical path, goal is optimizing for end-user performance (whether the end-user is a human or a machine doesn't matter). So we wanted a definition of a trace which had a blocking top-level span.

However, there's plenty of workloads that exhibit chained work patterns like queue insertion with a quick ack followed by downstream processing. These are also very important to trace, but can't be modeled using the above definition of a trace. (This type of behavior sounds parallel to the group commit case: something is written to a log, then later processed.)

For that reason, we decided a "meta-trace" is the path which holds the most semantic value: each "stage" of the pipeline/processing-graph can be analyzed as a separate application based on its traces, with its own dependencies, hot spots, etc. But also the entire meta-trace can be reconstructed for end-to-end tracing. This might include a many-to-one join in the case of things that batch processing (eg. writes), or a more simple waterfall and branching pattern for a lot of data pipelines.

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-172683372 .

dkuebric commented 8 years ago

@yurishkuro yes, though this is something we're actively working on and it's only being used internally so far, so it's not documented externally. Our API is very "flat" and based almost entirely on semantics such that each event (~span) is a bag of key/value pairs. So the way to note one or more parents is simply to add one or more ParentID values to the root of a new trace.

@cmccabe yeah, at risk of complicating this notion, but actually hoping to clarify it, I think there are two classes of use-case for multiple-parent we've seen in discussion so far:

  1. Tracking join of parallel work in a blocking top-level request, which I argue above is a single-trace use-case vs
  2. Tracking join of multiple work-streams which may not be blocking top-level requests, which I argue is a meta-trace use-case.

In 1 we'd be looking at spans with multiple parents; in 2 we'd be looking at traces with multiple parent traces.

I do think 1 becomes quite esoteric for span-based architectures, but is worth capturing if it's not too onerous to support API-wise (don't have a strong feeling on this--it is more important for event-based architectures than span-based ones). 2 is potentially dependent on a discussion about the scope of work to be included in a single trace, which I'm not sure has been discussed yet.

cmccabe commented 8 years ago

Sorry if these are dumb questions. But there are still a lot of things about the "meta-trace" or "meta-span" concept I don't understand. spans have a natural and elegant nesting property; do meta-spans nest, or do I need a meta-meta-span? Also, if meta-spans are forking and joining, then it seems like we have the multiple parent discussion we had with spans all over again, with the same set of possible solutions.

The best argument I have heard for meta-spans is that regular spans don't get sent to the server until the span ends (at least in HTrace), which is impractical if the duration of the span is minutes or hours.

Does it make sense to use terminology like "phase" or "job" rather than "meta-span"? "meta-span" or "meta-trace" seems to define it terms of what it is not (it's not a span) rather than what it is.

Rather than adding meta-spans, we could also add point events, and have the kicking off of some big job or phase generate one of these point events. And similarly, the end of a big job or phase could be another point event. At least for systems like MapReduce, Spark, etc. we can use job ID to relate spans with system phases.

On the other hand, if we had something like meta-spans, perhaps we could draw high-level diagrams of the system's execution plan. These would look a lot like the execution plan diagrams generated by something like Apache Drill or Apache Spark. It would be informative to put these on the same graph as some spans (although the GUI challenges are formidable.)

Colin

On Mon, Jan 18, 2016 at 4:38 PM, Dan Kuebrich notifications@github.com wrote:

@yurishkuro https://github.com/yurishkuro yes, though this is something we're actively working on and it's only being used internally so far, so it's not documented externally. Our API is very "flat" and based almost entirely on semantics such that each event (~span) is a bag of key/value pairs. So the way to note one or more parents is simply to add one or more ParentID values to the root of a new trace.

@cmccabe https://github.com/cmccabe yeah, at risk of complicating this notion, but actually hoping to clarify it, I think there are two classes of use-case for multiple-parent we've seen in discussion so far:

  1. Tracking join of parallel work in a blocking top-level request, which I argue above is a single-trace use-case vs
  2. Tracking join of multiple work-streams which may not be blocking top-level requests, which I argue is a meta-trace use-case.

In 1 we'd be looking at spans with multiple parents; in 2 we'd be looking at traces with multiple parent traces.

I do think 1 becomes quite esoteric for span-based architectures, but is worth capturing if it's not too onerous to support API-wise (don't have a strong feeling on this--it is more important for event-based architectures than span-based ones). 2 is potentially dependent on a discussion about the scope of work to be included in a single trace, which I'm not sure has been discussed yet.

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-172692849 .

yurishkuro commented 8 years ago

@dankosaur Indeed, a higher level API may not be necessary if spans from another trace can be added as parents. I know people have concerns with String->Any tags, but I would be ok with relaxing String->BasicType restriction (since it won't be enforced at method signature level anyway) for tags in the ext.tags namespace (in lieu of special-purpose API as in #18), so that we could register multiple parents with:

parent_contexts = [span.trace_context for span in parent_spans]
span.set_tag(ext.tags.MULTIPLE_PARENTS, parent_contexts)

@bensigelman ^^^ ???

codefromthecrypt commented 8 years ago

There's a running assumption that each contribution of an rpc is a different span. While popular, this isn't the case in zipkin. Zipkin puts all sides sides in the same span, similar to how in http/2 there's a stream identifier used for all request and response frames in that activity.

[ operation A ] <-- all contributions share a span ID [ [cs] [sr] [ss] [cr] ]

If zipkin split these into separate spans, it would look like...

[ operation A.client ], [ operation A.server ] <-- each contribution has a different a span ID [ [cs] [cr] ], [ [sr] [ss] ]

Visually, someone could probably just intuitively see they are related. With a "kind" field (like kind.server, kind.client), you could probably guess with more accuracy that they are indeed the same op.

Am I understanding the "meta-trace" aspect as a resolution to the problem where contributors to the same operation do not share an id (and could, if there's was a distinct parent)?

ex. [ operation A ] [ operation A.client ], [ operation A.server ] <-- both add a parent ID of the above operation? [ [cs] [cr] ], [ [sr] [ss] ]

codefromthecrypt commented 8 years ago

I don't think we need to conflate support of multiple parents with widening the data type of the tag api, particularly this early in the game. For example, what if no api that supports multiple parents actually implements OT? We're stuck with the wide interface. I'd suggest folks encode into a single tag and leave complaints around that as an issue to work on later.

yurishkuro commented 8 years ago

My understanding of two spans per RPC approach is that the server-side span is a child of the client-side span. The main difference is in the implementation of the join_trace function - Zipkin v.1 implementation would implement join_trace by creating a span with the same trace_context it reads off the wire, while a "two-spans" tracer will implement join_trace by creating a child trace_context.

That is somewhat orthogonal to multi-parents issue. Any span can declare an additional parent span to indicate its casual dependency (a "join"). However, in case of two-spans per RPC it would be unexpected for a server-side span to declare more than one parent.

yurishkuro commented 8 years ago

I don't think we need to conflate support of multiple parents with widening the data type of the tag api, particularly this early in the game.

Isn't it what this issue is about, how to record multiple parents? I don't mind if it's done via set_tag or with set_parents(trace_contexts_list), but if we don't offer an API to do it, those existing systems with multi-parent support will have nothing to implement. FWIW, at Uber we're starting work right now to trace relationships from realtime requests to enqueued jobs, which is a multi-parent (meta-trace) use case, and it can be done with Zipkin v.1, mostly with some UI enhancements.

codefromthecrypt commented 8 years ago

I'm more comfortable with set_parents or the like than changing the tag api directly.

codefromthecrypt commented 8 years ago

and to be clear, my original intent was to determine if and how this impacts trace attributes (propagated tags) vs tags (ones sent out of band).

ex in both zipkin and htrace, the parent is a propagated field. In zipkin X-B3-ParentSpanId, and in HTrace, half of the span id's bytes.

One binding concern was if "adding a parent" is a user function? Ex. in HTrace the first parent is set always. Since parents are complicated, it is api affecting how they are used in practice.

yurishkuro commented 8 years ago

Arguably, in-band propagated parent-span-id in Zipkin is not necessary, it could've been sent out of band. It sounds like in AppNeta the multiple parent IDs are "tags", not propagated attributes. Does anyone know why exactly Zipkin decided to propagate parent ID?

codefromthecrypt commented 8 years ago

Does anyone know why exactly Zipkin decided to propagate parent ID?

Only a guess, but perhaps it is to ensure out-of-band spans don't need to read-back to figure out their parent id. I'm sure the answer can be discovered.

bhs commented 8 years ago

Getting back to the original subject (which is something I've been interest in since forever ago):

I'm personally most excited about use cases that – at some level – boil down to a shared queue. That certainly encompasses the buffered/flushed writes case as well as the event loop pushback I mentioned further up in the thread. In those cases, how much mileage can we get by giving the queue (or "queue": it may be a mysql server or anything that can go into pushback) a GUID and attaching those guids to spans that interact with them? It's different than marking a parent_id but seems (to me) to make the instrumentation easier to write and the tooling easier to build.

Thoughts?

(As for MapReduces, etc: I have always had a hard time getting monitoring systems that are built for online, interactive-latency applications to actually work well for offline, non-interactive-latency applications (like MR). The data models can seem so similar, yet the tuning parameters are often totally divergent. Maybe I just didn't try hard enough (or wasn't smart enough, etc, etc)! I do think it's academically interesting and am happy to keep hearing ideas.)

cmccabe commented 8 years ago

I don't think the buffered writes case in HDFS is similar to a queue. A queue typically has events going in and events coming out. The buffered writes case just has a buffer which fills and then gets emptied all at once, which is not the way a queue typically works. The HBase case doesn't even necessarily have ordering between the elements that are being processed in the join, which makes it even less like a queue.

Here are examples of things in HDFS that actually are queues:

We haven't seen a reason to trace these things yet (of course we might in the future). It is fair to say that so far, queues have not been that interesting to us.

Consider the case of an HBase PUT. This is clearly going to require a group commit, and that group commit is going to require an HDFS write and flush. If you create a new request on every "join," you would have to look at three different "HTrace requests" to see why this had high latency.

These things are all logically part of the same PUT request, so why would we split them? And if we did, how would the users get from one request to the next? The GUI tooling understands how to follow parents to chidren, but not how to look up arbitrary foreign keys. The DAG model of execution is closer to reality than the tree model, so why should we force a tree on things that aren't tree-like?

best, Colin

On Tue, Jan 19, 2016 at 9:46 PM, bhs notifications@github.com wrote:

Getting back to the original subject (which is something I've been interest in since forever ago):

I'm personally most excited about use cases that – at some level – boil down to a shared queue. That certainly encompasses the buffered/flushed writes case as well as the event loop pushback I mentioned further up in the thread. In those cases, how much mileage can we get by giving the queue (or "queue": it may be a mysql server or anything that can go into pushback) a GUID and attaching those guids to spans that interact with them? It's different than marking a parent_id but seems (to me) to make the instrumentation easier to write and the tooling easier to build.

Thoughts?

As for MapReduces, etc: I have always had a hard time getting monitoring systems that are built for online, interactive-latency applications to actually work well for offline, non-interactive-latency applications (like MR). The data models can seem so similar, yet the tuning parameters are often totally divergent. Maybe I just didn't try hard enough (or wasn't smart enough, etc, etc)! I do think it's academically interesting and am happy to keep hearing ideas.

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-173094820 .

bhs commented 8 years ago

@cmccabe buffered writes can have, well, queuing problems... the buffer is an "intermediary" between the operations trying to write and the final resting place of the data. I agree that it's not a simple push/pop sort of producer-consumer queue, and I think that's what you're saying.

I'm interested by your comment that "queues have not been that interesting to us." Do you mean that HBase doesn't have queuing problems? And/or that users don't want to understand what's in the queue/intermediary when HBase is in pushback? Bigtable is admittedly a different system than HBase, but that was of great interest to me as a Bigtable user when the tabletserver my process was talking to became unresponsive. Were there tools that reliably helped in such scenarios? Not really. Would I have liked to use one? Absolutely.

Back to your question of why we would "split" the PUT, group commit, and stream flush: logically, I would prefer not to split them... that's what this thread is about, of course.

The DAG model in the abstract is sound. It is less clear in the presence of sampling, though... For instance, if sampling decisions are made at the root of a trace (i.e., when there's no inbound edge, regardless of whether it's a DAG or a tree), how do we expect to understand the history of the other PUTs/etc in our HBase group commit request if they weren't sampled?

So, the other spans involved in the group commit are either all sampled or not-all-sampled. If they're all sampled, the tracing system needs to be able to handle high throughput. If they're not all sampled, the tracing system will not be able to tell a complete story about queuing problems or other slowness involving the group commit.

For a tracing system that can afford to sample all requests, in my mind the presence of unique ids for specific queues opens the door to various useful UI features. If it would be helpful, I could try to describe such features... but IMO just assembling one gigantic DAG trace that includes everything in a batch as well as all of its downstream and upstream (transitive) edges is problematic from both a systems standpoint and a visualization standpoint without additional meta-information about the structure of the system and the various queues/intermediaries.

cmccabe commented 8 years ago

On Wed, Jan 20, 2016 at 9:58 PM, bhs notifications@github.com wrote:

@cmccabe https://github.com/cmccabe buffered writes can have, well, queuing problems... the buffer is an "intermediary" between the operations trying to write and the final resting place of the data. I agree that it's not a simple push/pop sort of producer-consumer queue, and I think that's what you're saying.

Maybe my view of queues is too narrow. But when I think of a queue, I think of a data structure with a well-defined ordering, where I take out exactly the same elements that I put in, not some combination. Queuing also has a strong suggestion that something is going to be processed in an asynchronous fashion (although strictly speaking that isn't always true). None of those things always hold true for the examples we've been discussing, which makes me a little reluctant to use this nomenclature. Do you think "shared work" is a better term than "queuing"?

In particular, I think your solution of foreign keys is the right thing to do for asynchronous deferred work (which is the first thing that pops into my mind when I think of a queue) but I'm not so sure about shared work that is done synchronously.

I'm interested by your comment that "queues have not been that interesting to us." Do you mean that HBase doesn't have queuing problems? And/or that users don't want to understand what's in the queue/intermediary when HBase is in pushback? Bigtable is admittedly a different system than HBase, but that was of great interest to me as a Bigtable user when the tabletserver my process was talking to became unresponsive. Were there tools that reliably helped in such scenarios? Not really. Would I have liked to use one? Absolutely.

I agree that when things get busy, it is interesting to know what else is going on in the system. I (maybe naively?) assumed that we'd do that by looking at the HTrace spans that were going on in the same region or tablet server around the time the "busy-ness" set in. I suppose we could attempt to establish a this-is-blocked-by-that relationship between various requests... perhaps someone could think of cases where this would be useful for HBase? I wonder what advantages this would this have over a time-based search?

Back to your question of why we would "split" the PUT, group commit, and stream flush: logically, I would prefer not to split them... that's what this thread is about, of course.

The DAG model in the abstract is sound. It is less clear in the presence of sampling, though... For instance, if sampling decisions are made at the root of a trace (i.e., when there's no inbound edge, regardless of whether it's a DAG or a tree), how do we expect to understand the history of the other PUTs/etc in our HBase group commit request if they weren't sampled?

So, the other spans involved in the group commit are either all sampled or not-all-sampled. If they're all sampled, the tracing system needs to be able to handle high throughput. If they're not all sampled, the tracing system will not be able to tell a complete story about queuing problems or other slowness involving the group commit.

Certainly the group commit, by its very nature, combines together work done by multiple top-level requests. You can make the argument that it is misleading to attach that work to anything less than the full set of requests. But I think in practice, we can agree that it is much more useful to be able to associate the group commit with what triggered it, than to skip that ability. Also, this criticism applies equally to foreign key systems-- if the user can somehow click through from the PUT to the hdfs flush, doesn't that suggest a 1:1 relationship to the user even if one doesn't exist?

For a tracing system that can afford to sample all requests, in my mind the presence of unique ids for specific queues opens the door to various useful UI features. If it would be helpful, I could try to describe such features... but IMO just assembling one gigantic DAG trace that includes everything in a batch as well as all of its downstream and upstream (transitive) edges is problematic from both a systems standpoint and a visualization standpoint without additional meta-information about the structure of the system and the various queues/intermediaries.

If the shared work is "gigantic" that will be a problem in both the multi-parent and foreign key scenarios. Because I assume that you want the shared work to be traced either way (I assume you are not proposing just leaving it out). In that case we need to explore other approaches such as intra-trace sampling or somehow minimizing the number of spans used to describe what's going on.

Conceptually, having an hdfs flush span that has "foreign keys" to write requests A, B, and C seems very similar to having an hdfs flush span that has parents of A, B, and C. I don't understand why having a DAG of trees of spans is acceptable but just having a DAG of spans is not. Does this simplify the GUI?

Colin

Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-173465261 .

bhs commented 8 years ago

Hey Colin,

One final thing about "queue", the word: I don't much care what we call it, I'm just trying to find a word we can use to describe the concept. I guess I've often heard people talk about "queueing problems" in datastore workloads, but whatever term you want to use is fine by me.

Anyway, re your last paragraph:

Conceptually, having an hdfs flush span that has "foreign keys" to write requests A, B, and C seems very similar to having an hdfs flush span that has parents of A, B, and C. I don't understand why having a DAG of trees of spans is acceptable but just having a DAG of spans is not. Does this simplify the GUI?

Yeah, so, I don't really have strong opinions about the "DAG of trees of spans" vs "DAG of spans" question per se. Both could work... I was more interested in avoiding what otherwise seems (?) like an O(N^2) edge proliferation... Again, looking at a fictitious queue:

tail--> [C D E F G H I J K] <--head

If we say that C depends on D, E, ..., K, doesn't D depend on E, F, ..., K? I liked the idea of creating a guid for the flush buffer / queue / whatever-we-want-to-call it because each span would have the single reference to that guid and a tracing system could infer the dependency relationships between the various buffered items.

The unfortunate thing about what I'm proposing is that tracing systems need to be aware of a new sort of construct. But I was hoping it would offer a more "declarative" (for lack of a better word) way to describe what's going on.

cmccabe commented 8 years ago

On Thu, Jan 21, 2016 at 4:04 PM, bhs notifications@github.com wrote:

Hey Colin,

One final thing about "queue", the word: I don't much care what we call it, I'm just trying to find a word we can use to describe the concept. I guess I've often heard people talk about "queueing problems" in datastore workloads, but whatever term you want to use is fine by me.

I'm still unsure whether "a queue" is the right term for the generic concept of shared work we are talking about here. Wikipedia defines a queue as a "a particular kind of abstract data type or collection in which the entities in the collection are kept in order and the principal (or only) operations on the collection are the addition of entities to the rear terminal position, known as enqueue, and removal of entities from the front terminal position, known as dequeue." This doesn't seem like a very good description of something like a group commit, where you add a bunch of elements in no particular order and flush them all at once. It's not really a good description of something like an HDFS flush either, where you accumulate N bytes in a buffer and then do a write of all N. It's not like there are processes on both ends pulling individual items from a queue. It's just a buffer that fills, and then the HDFS client empties it all at once, synchronously.

Anyway, re your last paragraph:

Conceptually, having an hdfs flush span that has "foreign keys" to write requests A, B, and C seems very similar to having an hdfs flush span that has parents of A, B, and C. I don't understand why having a DAG of trees of spans is acceptable but just having a DAG of spans is not. Does this simplify the GUI?

Yeah, so, I don't really have strong opinions about the "DAG of trees of spans" vs "DAG of spans" question per se. Both could work... I was more interested in avoiding what otherwise seems (?) like an O(N^2) edge proliferation... Again, looking at a fictitious queue:

tail--> [C D E F G H I J K] <--head

If we say that C depends on D, E, ..., K, doesn't D depend on E, F, ..., K ?

We use multiple parents in HTrace today, in the version that we ship in CDH5.5. It does not cause an O(N^2) edge proliferation. Flush spans have a set of parents which includes every write which had a hand in triggering the flush. I don't see any conceptual reason why the individual writes should depend on one another. One write is clearly not the parent of any other write, since the one didn't initiate the other.

I liked the idea of creating a guid for the flush buffer / queue / whatever-we-want-to-call it because each span would have the single reference to that guid and a tracing system could infer the dependency relationships between the various buffered items.

The unfortunate thing about what I'm proposing is that tracing systems need to be aware of a new sort of construct. But I was hoping it would offer a more "declarative" (for lack of a better word) way to describe what's going on.

Hmm. Maybe we need to get more concrete about the advantages and disadvantages of multiple parents vs. foreign keys.

With multiple parents, a parent span can end at a point in time before a child span. For example, in the case of doing a write which later triggers a flush, the write might finish long before the flush even starts. This makes it impossible to treat spans as a flame graph or traditional stack trace, like you can in a single-parent world. This may make writing a GUI harder since you can't do certain flame-graph-like visualizations.

With foreign keys, we can draw "a dotted line" of some sort between requests. For example, if the write is one request and the flush is another, there might be some sort of dotted line between them in GUI terms. It's a bit unclear how to make this connection, though.

The other question is what the "foreign key" field should actually be. If it is a span ID, then it is easy for a GUI to follow it to the relevant "related request." It also makes more sense to use span IDs for things like HDFS flushes, that have no actual system-level ID. To keep things concrete, let's consider the HDFS flush case. In the "foreign key as span ID" scheme, the flush would switch from having write A, write B, and write C as parents to having all those spans as "foreign keys" (or maybe "related request spans"?) Aside from that, nothing would change.

Whether you choose multiple parents or foreign keys, you still have to somehow deal with the "sampling amplification" issue. That is, if 20 writes on average go into each flush, each flush will be 20x as likely to be traced as any individual write operation. That is, assuming that you really make a strong commitment to ensuring that writes can be traced all the way through the system, which we want to do in HTrace.

Colin

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-173755128 .

yurishkuro commented 8 years ago

To draw a conclusion on the impact on the API, can we agree on the following API?

span.add_parents(span1,  ...)

The method takes parent spans via varargs, and the spans do not have to belong to the same trace (solving meta-trace issue).

bhs commented 8 years ago

I am sorry to be a stick in the mud, but this still seems suspect to me... For one thing, we probably shouldn't assume we have actual span instances to add as parents. Also, the model described at http://opentracing.io/spec/ describes traces as trees of spans: we can talk about making traces into DAGs of spans, but I would rather we bite off something smaller for now.

One idea would be to aim for something like

span.log_event(ext.CAUSED_BY, payload=span_instance_or_span_id)

... or we could do something similar with set_tag and avert our eyes about the multimap issue (Yuri, I know that you lobbied for the "undefined" semantics so that multiple calls to set_tag may yield a multimap in some impls).

Also happy to schedule a VC about this topic since it's so complex in terms of implications. Or wait for Wednesday, whatever.

yurishkuro commented 8 years ago

fair point, I am not married to the word "parent", we can pick more abstract causality reference, like "starts_after"

I do prefer to provide "causal ancestors" a list of Spans. This has to be tracer-agnostic syntax, and the end user's code doesn't know what "span id" is, they can only know some serialized format of the span, and if the multi-parent span creation happens in a different process (like a background job caused by an earlier http request), then presumably the parent trace managed to serialize its tracer context before scheduling the job, so that the job may de-serialize it into a Span.

Finally, for the method signature, I prefer a dedicated method, since this functionality is actually in the public APIs of some existing tracing systems (HTrace, and possibly TraceView). Delegating this particular feature to a simple key/value of log/setTag methods doesn't feel right, especially due to lack of type enforcement.

btw, set_tag is not really an option since people were adamantly opposed to non-primitive tag values, as a result in Java we can't even pass an Object to setTag.

codefromthecrypt commented 8 years ago

Bogdan mentioned soft-links at the workshop, which would be a way to establish causality between traces, and calculate the time that a message was in-flight.

The tie-in to multiple parents are two-fold.

Firstly, some are raising concern that we are linking trees, not spans in the same trace. Perhaps there's relevance on that.

Secondly, there's concern about encoding. For example, in certain systems, propagated tags have constraints, like being a basic type (sampled flag) or a binary (trace id struct). Also, there are folks who have clearly wanted to keep "tags" contained.. for example, quite a few tracing systems reference these as simple string->string dicts.

I'm not sure this is a blocking concern. For example, in htrace, the multiple parents is actually a separate field in the span.. i.e. it isn't stored in the tags dict. In other words, tracing systems are not required to store everything as tags, and today they certainly don't (ex annotations/logs are not tags).

For those who aren't following OpenTracing, it may be more important there. There's currently no intent to extend the data structure to support fields besides tags or logs (annotations). In this case, if someone was using OpenTracing only, they'd need to stuff multiple parents (or soft-links) into something until such a feature was formally supported. This would lead to comma or otherwise joining (if a tag), or stuffing them into logs (which can repeat).

Long story short.. I think the OpenTracing encoding question is relevant to OT as of February 20, but not a blocking concern for tracing systems who have extended their model, are ok to extend their model, or do not see encoding with commas or otherwise an immediate concern.

On Sat, Feb 20, 2016 at 6:44 AM, Yuri Shkuro notifications@github.com wrote:

fair point, I am not married to the word "parent", we can pick more abstract causality reference, like "starts_after"

I do prefer to provide "causal ancestors" a list of Spans. This has to be tracer-agnostic syntax, and the end user's code doesn't know what "span id" is, they can only know some serialized format of the span, and if the multi-parent span creation happens in a different process (like a background job caused by an earlier http request), then presumably the parent trace managed to serialize its tracer context before scheduling the job, so that the job may de-serialize it into a Span.

Finally, for the method signature, I prefer a dedicated method, since this functionality is actually in the public APIs of some existing tracing systems (HTrace, and possibly TraceView). Delegating this particular feature to a simple key/value of log/setTag methods doesn't feel right, especially due to lack of type enforcement.

btw, set_tag is not really an option since people were adamantly opposed to non-primitive tag values, as a result in Java we can't even pass an Object to setTag.

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-186438501 .

yurishkuro commented 8 years ago

I think the OpenTracing encoding question is relevant to OT as of February 20, but not a blocking concern for tracing systems who have extended their model, are ok to extend their model, or do not see encoding with commas or otherwise an immediate concern.

@adriancole I don't follow this point. An end user does not have access to string representation of a span, the best they can do is use the injector to convert span to a map[string, string] and then do some concatenation of the result. If we propose this as the official way of recording casual ancestors, we're locking every implementation into that clunky format (which may not even be reversible without additional escaping, depending on the encoding). If we propose no official way, people would have to resort to vendor-specific APIs.

Ben's suggestion of using log with standard msg string at least resolves the encoding problem, since log() accepts any payload, including the Span. But casual ancestors aren't logs, they have no time dimension. Plus, for tracers that do not capture multi-parents properly, span.log('caused_by', other_span) might lead to peculiar side effects.

codefromthecrypt commented 8 years ago

@yurishkuro whoops.. sorry.. I was catching up on email and I mistook this thread for one in the distributed-tracing google group (which is why I said "For those who aren't following OpenTracing..").

codefromthecrypt commented 8 years ago

in other words, my last paragraph wasn't targeted towards the OT stewards, rather towards those who author tracers in general. ex. they may just simply support this feature or not (ex htrace does already), as they can control their api, model, and everything else. That paragraph isn't relevant to OT discussion.. which makes me think maybe I should just delete the comments as they weren't written for OT debates.

lookfwd commented 8 years ago
span.add_parents(span1,  ...)

Like it

we can talk about making traces into DAGs of spans

The sooner this gets done, the less likely there will be the need of opentracing 2.0.

Tracing is a DAG problem. The algorithms on top of trees and DAGs should be of similar complexity. The visualizations should change significantly but what is out there right now is suitable mostly for web-flows where there's a request etc. which is a minor part of what people need to monitor.

events_

Here's a little example of an engine that matches streams of tweets. A tweet arrived, triggered 1000 subscriptions with their own spans that trigger events on several servers (sharded by subscription id).

Notice that what you see above is just a single slice. We have 4 such slices (sets of servers) for resilience on an active/passive setup.

What is worthy under those conditions is to take the DAG for each tweet from each of the slices and compare it with all the others in terms of latency and correctness.

[Disclaimer: The example is artificial but similar to the one I'm working on]

dkuebric commented 8 years ago

There may be a case for DAG in some fork/join concurrency models: it gives the tracer more information about blocking events by clarifying joins. If we can live without that, or find a way to back it in later (some sort of "barrier event"), then we don't need multi-parent/multi-precedes IMO.

@lookfwd I think your example is very valid for tracing, but I wouldn't model it in the way you propose. A tweet is a triggering event which kicks off some processing. I'm not sure the same should be said of the subscription at that point, however. I'd argue the subscription is state: it's a user-defined configuration that exists in the system before tweet event.

If you consider the tweet as the triggering event and the subscriptions as values read and acted on at tweet time, you're back to a tree: the number of subscriptions becomes a high fan-out of the trace, but it's still a tree.

(The subscription create/modify may be its own trace-triggering event when it is first created and populated to whatever subsystems store the state.)

It's tempting to want to associate the influence of a user's subscription on later processing of tweets--however, I don't think a single trace is a good way to do so. If you want to model the subscription as an ongoing event, your "traces" will never end.

bhs commented 8 years ago

@dkuebric said:

If you want to model the subscription as an ongoing event, your "traces" will never end.

Very well said. DAGs are great and everything, but if we truly want to consider the general causal-dependency DAG as "a single trace", that single trace quickly becomes absolutely enormous and consequently intractable from a modeling standpoint.

Another very important consideration is sampling: if the sampling coin-flip (in Dapper-like systems) happens at the "DAG roots" (i.e., spans that have no incoming edge), then the add_parent call is likely ("almost always" with Dapper-style 1/1024 sampling, in fact) to point to a sub-DAG that's not even being recorded up the stack.

@lookfwd I don't want to sound unsympathetic, and perhaps we will even proceed with this... but what you're advocating has huge implications for sampling (either that or it will be semantically broken in most production tracing systems I'm aware of) and I worry about adding something to OT1.0 that contemporary tracing systems can only support partially, and for fundamental reasons. Does that make sense?

Or would we consider this an "aspirational feature" that's a bit caveat emptor?

lookfwd commented 8 years ago

@dkuebric There are two type of subscriptions. Some are long-lived where indeed what you say applies. The second case is "live" subscriptions that last for however long the user is connected (from seconds to minutes). We have 10's of thousands of those per day.

I don't see a problem with never ending traces by the way. I think it's valid to query "give me anything you know that happened between @tstart till @tend for that event". I think this is what always happens where @tend=now.

Frankly - for me there are two separate things:

I think that no one would argue that the reality is DAG and trees are a subset of use cases. These are the facts. This is what I want to have on my distributed trace logs even if I don't have the tools to visualize or create alarms on those facts right now.

Now on the visualization part... I think that the common view is this:

zipkin

which is very valid and indeed it's the present. But I believe that equally well someone could extract timed sequence diagrams from traces:

r054bnh

and in those DAGs can very well be expressed as well as rich casual relationships. On the contrary state is somewhat complex to express.

No matter what the tools look like right now... no matter what we do with data... we should collect the facts with a mindset of what does really happen.

@bensigelman I think I get what you mean with sampling. Yes, you need to trace everything :) I will have a much better look on the spec. The truth is that I've spent just a few hours trying to understand if it is suitable for our use case so I might well be missing many aspects / constraints!

A little update. If you want to sample per shard e.g. anything where *ID % 1000 == 0 there shouldn't be a problem with DAGs and sampling.

bhs commented 8 years ago

@lookfwd this is interesting stuff and I'm happy to hop on a call or VC to discuss in detail: probably more useful than just reading the spec, but up to you.

As for the update at the end of your last message: there are plenty traces that span logical shards so I'm not sure what "sampling by shard" means in that context. Also, sampling is done both to reduce the load on the tracing system and on the host process: if the host process is in shard_id % 1000 == 0, then everything therein will be sampled and there can be an observer effect / perf degradation.

bhs commented 8 years ago

I believe that the SpanContext and Reference concepts deal with this issue cleanly. I'll close in a few days if there are no objections.

yurishkuro commented 8 years ago

@bensigelman I prefer to keep this open until we define more exotic SpanReferenceTypes that can actually represent the scenarios mentioned here.

bhs commented 8 years ago

Thinking more about this, the other thing we're missing (in addition to more exotic reference types) is the capacity to add References during the lifetime of a Span, and not just at start time.

Since both of these are backwards-compatible API additions, I'm going to remove the "1.0 Spec" milestone for this issue.

yurishkuro commented 8 years ago

a) so we're going with 1.1, not 2.0? b) can you give an example where it's necessary to attach span reference after the span has started?

tinkerware commented 8 years ago

@yurishkuro For me, it would be a group commit operation that does not know the ancestor commits before creating the span. It would need to capture the span contexts from the individual commit operations, put them in a set, then attach each one to the group commit span with a Groups reference type. Without the ability to attach span references later on, the instrumentation gets unwieldy; you have to capture the starting timestamp of the group commit, then start the span with the explicit timestamp. This breaks the typical code flow where you can use a try-with-resources clause or equivalent to surround the instrumented block, more so if you are also trying to capture faults/exceptions.

Another example (one that I'm currently working on) is integrating with an in-process instrumentation library. It's a lot more convenient to be able to attach references after creating a span; I don't have to worry about having all references in order from ancestor spans in the instrumentation library at the point where I need to create the descendant span.

bhs commented 7 years ago

See https://github.com/opentracing/specification/issues/5 to continue this discussion.