opentracing / specification

A place to document (and discuss) the OpenTracing specification. 🛑 This project is DEPRECATED! https://github.com/opentracing/specification/issues/163

http://opentracing.io/spec

Apache License 2.0

1.18k stars 182 forks source link

non-RPC spans and mapping to multiple parents #5

Open opentracing-importer opened 8 years ago

opentracing-importer commented 8 years ago

Issue by adriancole Sunday Jan 17, 2016 at 01:59 GMT Originally opened as https://github.com/opentracing/opentracing.io/issues/28

One of my goals of working in OpenTracing is to do more with the same amount of work. For example, when issues are solved in OpenTracing, and adopted by existing tracers, there's a chance for less Zipkin interop work, integrations and maintenance. Zipkin's had a persistent interoperability issue around non-RPC spans. This usually expresses itself as multiple parents, though often also as "don't assume RPC".

In concrete terms, Zipkin V2 has a goal to support multiple parents. This would stop the rather severe signal loss from HTrace and Sleuth to Zipkin, and of course address a more fundamental concern: the inability to express joins and flushes.

In OpenTracing, we call certain things out explicitly, and leave other things implicit. For example, the existence of a span id at all is implicit, except the side-effect where we split the encoded form of context into two parts. We certainly call out features explicitly, like "finish", and of course these depend on implicit functionality, such as harvesting duration from a timer.

Even if we decide to relegate this to an FAQ, I think we should discuss multiple parents, and api impact. For example, are multiple parents tags.. or attributes? Does adding parents impact attributes or identity? Can an HTrace tracer be built from an OpenTracing one without signal loss? Are there any understood "hacks" which allow one to encode a multi-parent span effectively into a single-parent one? Even if we say "it should work", I'd like to get some sort of nod from a widely-used tracer who supports multiple parents.

The practical impact of this is that we can better understand in Zipkin whether this feature remains a zipkin-specific interop story with, for example HTrace, or something we leverage from OpenTracing.

For example, it appears that in AppNeta, adding a parent, or edge is a user-level task, and doesn't seem to be tied with in-band aka propagated fields? @dankosaur is that right?

In HTrace, you add events and tags via TraceScope, which manages a single span, which encodes into its id a single primary parent. You can access the "raw" span, and assign multiple parents, but this doesn't change the identity of the span, and so I assume doesn't impact propagation. @cmccabe is that right?

I'm sure there are other multiple-parent tracers out there.. I'd love to hear who's planning to support OpenTracing and how that fits in with multiple parents.

opentracing-importer commented 8 years ago

Comment by bensigelman Sunday Jan 17, 2016 at 05:24 GMT

@adriancole I'm glad you brought this up – an important topic. I am going to dump out some ideas I have about this – no concrete proposals below, just food for thought.

<ramble>

I was careful to make sure that a dapper- or zipkin-like parent_id is not reified at the OpenTracing level... that is a Tracer implementation concern. That said, the Span and TraceContext APIs show a bias for single-parentage traces (if this isn't obvious I can elaborate). OpenTracing docs even describe traces as "trees" rather than "DAGs".

In the current model, multiple parents could be represented as Span tags or – I suppose – as log records, though that latter idea smells wrong. Trace Attributes do not seem like the right fit since parentage relationships are a per-Span rather than per-Trace concern. (On that note: IMO the parent_id should never be a part of the TraceContext as there's no need to send it in-band over the wire... it can just be a Span tag.)

Let me also throw out this other related use-case that I think about often: delays in "big" executor queues, e.g. the main Node.js event loop. If each such executor has a globally unique ID and spans make note of those unique IDs as they pass through the respective queue, a sufficiently dynamic tracing system can explain the root cause of queuing delays (which is an important problem that is usually inscrutable). To be more concrete, suppose the following diagram illustrates the contents of a FIFO executor queue:

    [  C  D  E  F  G  H  I  J  K  L  ]
                                  ^-- next to dequeue and execute

Let's say that the Span that enqueued C ends up being slow because the items ahead of it in this queue were too expensive. In order to truly debug the root cause of that slowness (for C), a tracing system should be talking about items D-L... at least one of them took so long that the wait to get to the front of the executor queue was too long.

So, the big question: is C a parent for D-L? After all, it is blocked on them, right? And if C is a parent, what do we say about the more direct/obvious parents of D-L, whatever they are?

Anyway, this example is just meant to provide a practical / common example of tricky causality and data modeling. There are analogous examples for coalesced writes in storage systems, or any time batching happens, really.

</ramble>

opentracing-importer commented 8 years ago

Comment by yurishkuro Sunday Jan 17, 2016 at 16:35 GMT

I think this should be another page on the main website - recipes for handing the scenarios mentioned above, and others we discussed on various issues, like marking a trace as "debug". The goal of OpenTracing is to give instrumenters a standard language to describe the computation graph shape, regardless of the underlying tracing implementation, so we cannot give answers like "this is implementation specific", or "this could be done like this" - the answer needs to be "this is done this way", otherwise instrumenters can walk away none the wiser.

Of course, it is also helpful to know the exact use case the user is asking about. For example, it's not clear to me that the queueing/batching @bensigelman describes is a use case for multiple parents. The main answer the users want is why it took my span so long to finish. So the investigation could be done in two steps, first the span logs the time when it was enqueued and when it was dequeued and executed. If the gap is large, it already indicates a delay on the event loop. To investigate the delay, user can run another query in the system asking for spans that too long to actually execute once dequeued, and thus delayed everybody else. A very intelligent tracing system may be clever enough to auto capture the items ahead in the queue based on the global ID of the executor, but we still need to have a very precise recipe to the instrumentors of what exactly they need to capture regardless of the underlying tracing implementation.

Going back to the multi-parent question, do we understand which scenarios actually require it?

As for capturing multiple parents, I would suggest using span tags for that, i.e. we declare a special ext tag and do

for parent in parent_spans:
    span.set_tag(ext.tags.PARENT, parent.trace_context)

(which is another reason I was lobbying for string->[]any, and it's still possible to do the above based on the current API and semantics).

opentracing-importer commented 8 years ago

Comment by bensigelman Sunday Jan 17, 2016 at 20:59 GMT

@yurishkuro yes, my point was not that the queue is a good fit for multiple-parentage (it's not) but more to redirect the conversation around motivating scenarios rather than a particular DAG structure. This is very much in line with your suggestion (which I heartily endorse) that we provide sensible, opinionated guidance for best-practice instrumentation of certain archetypical scenarios: RPC boundaries, high-throughput executor queues, coalescing write buffers, whatever.

As for relying on the "intentionally undefined" semantics of multiple calls to set_tag, I would perhaps prefer forcing the multiple parents to be set all at once and just string-joining them into a single tag value. This would be the "making the hard things possible" of the old make the easy things easy and the hard things possible adage (i.e., it's admittedly clumsy):

parent_span_ids = map(lambda s: str(s.trace_context.span_id), parent_spans)
span.set_tag(ext.tags.MULTIPLE_PARENTS, strings.join(parent_span_ids, ","))

opentracing-importer commented 8 years ago

Comment by yurishkuro Sunday Jan 17, 2016 at 21:08 GMT

Why not just pass the array of parent trace contexts?

span.set_tag(ext.tags.PARENTS, [parent.trace_context for parent in parent_spans])

Passing string IDs is not portable, we don't even have trace_context.span_id as a formal requirement in the API.

opentracing-importer commented 8 years ago

Comment by bensigelman Sunday Jan 17, 2016 at 21:14 GMT

I was respecting the BasicType-ness of set_tag's second parameter... I was mainly just illustrating the concatenation/joining. (Coercing a TraceContext into a BasicType or string is a problem regardless of set_tag/add_tag)

opentracing-importer commented 8 years ago

Comment by dkuebric Monday Jan 18, 2016 at 20:26 GMT

Not to beat a dead horse, but I agree that queue depth is not a good use-case for multiple parents. (It's tempting, for instance, to take that on a slippery slope all the way up to OS-level scheduling!) IMO distributed tracing is about understanding a single flow of control across a distributed system--concurrent work may factor into how that request was handled, but tracing should keep the unit of work being traced as the center of reported data, with info in a trace being "relative to" that trace.

The use-cases I see for multiple parents are around asynchronous work done in a blocking context--in the service of handling a single request or unit of work (to distinguish from the event loop case above). It is tempting to say that the join is optional, because someone reading the trace can probably infer a blocking/nonblocking relationship from the trace structure. However, the join is valuable information for those unfamiliar with the system who are reading individual traces, or for a tracer which wants to do more sophisticated analysis on the corpus of traces, because it signals where asynchronous work is actually blocking critical path in a manner which is less open to interpretation.

Some examples we see commonly in web apps are libcurl's curl_multi_exec (and the many libraries that wrap it), or libraries which are async by at the underlying implementation level but actually end up being used synchronously a lot of the time (spymemcached). Instrumenting these to capture both use-cases benefits from being able to distinguish between the two execution patterns.

In AppNeta's X-Trace implementation, multi-parent is also used to record the ID of a remote (server-side) span event when it replies to the client agent. This is largely because the methodology is based on events instead of spans. For instance, a remote call made with httplib in python would involve 2 events if the remote side is not instrumented (httplib entry, httplib exit), or 4+ if the remote side is instrumented. The httplib exit event would have edges to both the httplib entry and remoteserver exit in that case.

I like the idea of supporting this type of behavior, but it seems less pressing in a span-oriented world. The main argument I can see is an understanding of blocking critical path vs not in analysis of traces. I'm curious: are there other arguments for multi-parent out there? What is this used for in HTrace world?

(Also @bensigelman can you clarify your comment about not serializing parent_id? If the TraceContext is what is serialized going across the wire, shouldn't it hold a previous ID? I am probably missing something obvious here..)

opentracing-importer commented 8 years ago

Comment by bensigelman Monday Jan 18, 2016 at 21:26 GMT

@dankosaur per your question about parent_id: If we're using a span-based model, IMO an RPC is two spans, one on the client and one on the server. The client span's TraceContext is sent over the wire as a trace_id and span_id, and that client span_id becomes the parent_id of the server span. Even if a single span is used to model the RPC, as long as the client logs the span's parent_id there should be no need for the server to log it as well (so, again, no need to include it in-band with the RPC payload). Hope that makes sense... if not I can make a diagram or something.

opentracing-importer commented 8 years ago

Comment by dkuebric Monday Jan 18, 2016 at 21:38 GMT

Thanks, that makes sense--the span_id becomes the parent id of the receiving span. It's the same way in X-Trace.

opentracing-importer commented 8 years ago

Comment by cmccabe Monday Jan 18, 2016 at 22:43 GMT

As Adrian mentioned, in HTrace, we allow trace spans to have multiple parents. They form a directed acyclic graph, not necessarily a tree.

One example of where this was important is the case of writing data to an HDFS DFSOutputStream. The Java stream object contains a buffer. This buffer will be flushed periodically when it gets too big, or when one of the flush calls is made. The call to write() will return quickly if it is just storing something to the buffer.

Another example is in HBase. HBase has a write-ahead log, where it does "group commit." In other words, if HBase gets requests A, B, and C, it does a single write-ahead log write for all of them. The WAL writes can be time-consuming since they involve writing to an HDFS stream, which could be slow for any number of reasons (network, error handling, GC, etc. etc.).

What both of these examples have in common is that they involve two or more requests "feeding into" a single time-consuming operation. I think some people in this thread are referring to this as a "join" since it is an operation that joins several streams of execution (sometimes quite literally, by using Executors or a fork/join threading model).

We had a few different choices here:

1. Arbitrarily assign the "blame" for the flush to a single HTrace

request. In the DFSOutputStream, this would mean that we would ignore DFSOutputstream buffer flushes unless the HTrace request had to wait for them. In HBase, what we would do is rather less clear-- the requests that are being coalsced into a "group WAL commit" don't necessarily have any user-visible ordering, so the choice of which one to "blame" for the group commit would be completely arbitrary from the user's point of view.

In a world where we're using less than 1% sampling, solution #1 would mean that relatively few HDFS flushes would ever be traced. It also means that if two traced writes both contributed to a flush, only one would take the "blame." For HBase, solution #1 would mean that there would be a fair number of requests that would be waiting for the group commit, but have no trace spans to reflect that fact.

Solution #1 is simple to implement. As far as I can tell, most distributed tracing systems took this solution. You can build a reasonable latency outlier analysis system this way, but you lose a lot of information about what actually happened in the system.

2. Denormalize. If two traced writes came in, we could create "separate

trees" for the same flush. This solution is superficially attractive, but there are a lot of practical difficulties. Clearly, it increases the number of spans exponentially for each branching point. Since we had this problem at multiple layers of the system, this was not an attractive solution.

3. A more complex data model that had "extra edges" beyond the

parent/child relationships we traditionally used. For example, HDFS flushes could become top-level HTrace requests that were somehow associated with other requests (perhaps by some kind of "extra ID". The problem with this is that your tooling becomes much more complex and project-specific. It's already hard enough to explain the current simple data model to people without making it even more complex and domain-specific. We also have multiple layers at which this problem happens, so it would become harder for even experts to follow a single request all the way through the system.

4. Support multiple parents. This wasn't difficult at the model layer.

It made some things more challenging at the GUI layer, but not by much. Our programmatic interface for adding multiple parents is still a bit awkward-- this is something we might want to work on in the future.

I'm curious what you guys would suggest for solving cases like this one. We have tried to come up with something that was useful for Hadoop and HBase, and hopefully the wider ecosystem as well. I didn't see a lot of discussion about this in any of the tracing publications and discussions I read-- perhaps I missed it.

best, Colin

On Mon, Jan 18, 2016 at 1:38 PM, Dan Kuebrich notifications@github.com wrote:

Thanks, that makes sense--the span_id becomes the parent id of the receiving span. It's the same way in X-Trace.

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-172661063 .

opentracing-importer commented 8 years ago

Comment by dkuebric Monday Jan 18, 2016 at 23:42 GMT

@cmccabe thanks for the write-up! The group commit is a really interesting use-case, and because I also have not seen much discussion around this, I'd love to hear broader thoughts. Particularly about solution 3 you present above, because that's the one AppNeta takes with regard to such work.

The reasoning behind picking option 3, which results in what we call "meta-traces" which have parent and child trace relationships, is based in the desire to be able to do smart aggregate analysis on traces. If a request trace is always a blocking unit of work at the top level, then you can start to mine it for critical path, goal is optimizing for end-user performance (whether the end-user is a human or a machine doesn't matter). So we wanted a definition of a trace which had a blocking top-level span.

However, there's plenty of workloads that exhibit chained work patterns like queue insertion with a quick ack followed by downstream processing. These are also very important to trace, but can't be modeled using the above definition of a trace. (This type of behavior sounds parallel to the group commit case: something is written to a log, then later processed.)

For that reason, we decided a "meta-trace" is the path which holds the most semantic value: each "stage" of the pipeline/processing-graph can be analyzed as a separate application based on its traces, with its own dependencies, hot spots, etc. But also the entire meta-trace can be reconstructed for end-to-end tracing. This might include a many-to-one join in the case of things that batch processing (eg. writes), or a more simple waterfall and branching pattern for a lot of data pipelines.

opentracing-importer commented 8 years ago

Comment by yurishkuro Tuesday Jan 19, 2016 at 00:22 GMT

@dankosaur we are also considering using a model that sounds very much like your meta-trace, for capturing relationship between some real-time trace and work it enqueues for later execution. At minimum it requires a small extension of capturing a "parent trace ID". Does AppNeta expose a higher level API for users to instrument their apps to capture these relationships?

opentracing-importer commented 8 years ago

Comment by cmccabe Tuesday Jan 19, 2016 at 00:30 GMT

Thanks for the insight, Dan.

I agree that for the "async work queue" case, you probably want to create multiple HTrace requests which you can then associate back together later. However, this case seems a little different than the "synchronous join" case that motivated us to use multiple parents. After all, in the async case, you are probably going to be focused more on things like queue processing throughput. In the "synchorous join" case, you need to focus on the latency of the work done in the joined part. In the specific example of HBase, if group commit has high latency, all the HBase requests that depend on that particular group commit will also have high latency.

However, it would certainly be possible to model the HBase group commit as a separate top-level request, and associate it back with whatever PUT or etc. HBase request triggered it. I guess we have to think about the advantages and disadvantages of that more, compared to using multiple parents.

We've been trying to figure out the right model to represent things like Hive jobs, where a SQL query is broken down into MapReduce or Spark jobs, which then break down further into executors, and so forth. It does seem like we will end up splitting spans quite a lot, and potentially using foreign keys to knit them back together. In that case, it definitely makes sense. The most basic level of support would be tagging HDFS / HBase spans with the ID of the current MapReduce or Spark job.

best, Colin

On Mon, Jan 18, 2016 at 3:42 PM, Dan Kuebrich notifications@github.com wrote:

@cmccabe https://github.com/cmccabe thanks for the write-up! The group commit is a really interesting use-case, and because I also have not seen much discussion around this, I'd love to hear broader thoughts. Particularly about solution 3 you present above, because that's the one AppNeta takes with regard to such work.

The reasoning behind picking option 3, which results in what we call "meta-traces" which have parent and child trace relationships, is based in the desire to be able to do smart aggregate analysis on traces. If a request trace is always a blocking unit of work at the top level, then you can start to mine it for critical path, goal is optimizing for end-user performance (whether the end-user is a human or a machine doesn't matter). So we wanted a definition of a trace which had a blocking top-level span.

However, there's plenty of workloads that exhibit chained work patterns like queue insertion with a quick ack followed by downstream processing. These are also very important to trace, but can't be modeled using the above definition of a trace. (This type of behavior sounds parallel to the group commit case: something is written to a log, then later processed.)

For that reason, we decided a "meta-trace" is the path which holds the most semantic value: each "stage" of the pipeline/processing-graph can be analyzed as a separate application based on its traces, with its own dependencies, hot spots, etc. But also the entire meta-trace can be reconstructed for end-to-end tracing. This might include a many-to-one join in the case of things that batch processing (eg. writes), or a more simple waterfall and branching pattern for a lot of data pipelines.

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-172683372 .

opentracing-importer commented 8 years ago

Comment by dkuebric Tuesday Jan 19, 2016 at 00:38 GMT

@yurishkuro yes, though this is something we're actively working on and it's only being used internally so far, so it's not documented externally. Our API is very "flat" and based almost entirely on semantics such that each event (~span) is a bag of key/value pairs. So the way to note one or more parents is simply to add one or more ParentID values to the root of a new trace.

@cmccabe yeah, at risk of complicating this notion, but actually hoping to clarify it, I think there are two classes of use-case for multiple-parent we've seen in discussion so far:

Tracking join of parallel work in a blocking top-level request, which I argue above is a single-trace use-case vs
Tracking join of multiple work-streams which may not be blocking top-level requests, which I argue is a meta-trace use-case.

In 1 we'd be looking at spans with multiple parents; in 2 we'd be looking at traces with multiple parent traces.

I do think 1 becomes quite esoteric for span-based architectures, but is worth capturing if it's not too onerous to support API-wise (don't have a strong feeling on this--it is more important for event-based architectures than span-based ones). 2 is potentially dependent on a discussion about the scope of work to be included in a single trace, which I'm not sure has been discussed yet.

opentracing-importer commented 8 years ago

Comment by cmccabe Tuesday Jan 19, 2016 at 00:58 GMT

Sorry if these are dumb questions. But there are still a lot of things about the "meta-trace" or "meta-span" concept I don't understand. spans have a natural and elegant nesting property; do meta-spans nest, or do I need a meta-meta-span? Also, if meta-spans are forking and joining, then it seems like we have the multiple parent discussion we had with spans all over again, with the same set of possible solutions.

The best argument I have heard for meta-spans is that regular spans don't get sent to the server until the span ends (at least in HTrace), which is impractical if the duration of the span is minutes or hours.

Does it make sense to use terminology like "phase" or "job" rather than "meta-span"? "meta-span" or "meta-trace" seems to define it terms of what it is not (it's not a span) rather than what it is.

Rather than adding meta-spans, we could also add point events, and have the kicking off of some big job or phase generate one of these point events. And similarly, the end of a big job or phase could be another point event. At least for systems like MapReduce, Spark, etc. we can use job ID to relate spans with system phases.

On the other hand, if we had something like meta-spans, perhaps we could draw high-level diagrams of the system's execution plan. These would look a lot like the execution plan diagrams generated by something like Apache Drill or Apache Spark. It would be informative to put these on the same graph as some spans (although the GUI challenges are formidable.)

Colin

On Mon, Jan 18, 2016 at 4:38 PM, Dan Kuebrich notifications@github.com wrote:

@yurishkuro https://github.com/yurishkuro yes, though this is something we're actively working on and it's only being used internally so far, so it's not documented externally. Our API is very "flat" and based almost entirely on semantics such that each event (~span) is a bag of key/value pairs. So the way to note one or more parents is simply to add one or more ParentID values to the root of a new trace.

@cmccabe https://github.com/cmccabe yeah, at risk of complicating this notion, but actually hoping to clarify it, I think there are two classes of use-case for multiple-parent we've seen in discussion so far:

Tracking join of parallel work in a blocking top-level request, which I argue above is a single-trace use-case vs

Tracking join of multiple work-streams which may not be blocking top-level requests, which I argue is a meta-trace use-case.

In 1 we'd be looking at spans with multiple parents; in 2 we'd be looking at traces with multiple parent traces.

I do think 1 becomes quite esoteric for span-based architectures, but is worth capturing if it's not too onerous to support API-wise (don't have a strong feeling on this--it is more important for event-based architectures than span-based ones). 2 is potentially dependent on a discussion about the scope of work to be included in a single trace, which I'm not sure has been discussed yet.

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-172692849 .

opentracing-importer commented 8 years ago

Comment by yurishkuro Tuesday Jan 19, 2016 at 03:10 GMT

@dankosaur Indeed, a higher level API may not be necessary if spans from another trace can be added as parents. I know people have concerns with String->Any tags, but I would be ok with relaxing String->BasicType restriction (since it won't be enforced at method signature level anyway) for tags in the ext.tags namespace (in lieu of special-purpose API as in #18), so that we could register multiple parents with:

parent_contexts = [span.trace_context for span in parent_spans]
span.set_tag(ext.tags.MULTIPLE_PARENTS, parent_contexts)

@bensigelman ^^^ ???

opentracing-importer commented 8 years ago

Comment by adriancole Tuesday Jan 19, 2016 at 03:23 GMT

There's a running assumption that each contribution of an rpc is a different span. While popular, this isn't the case in zipkin. Zipkin puts all sides sides in the same span, similar to how in http/2 there's a stream identifier used for all request and response frames in that activity.

[ operation A ] <-- all contributions share a span ID [ [cs] [sr] [ss] [cr] ]

If zipkin split these into separate spans, it would look like...

[ operation A.client ], [ operation A.server ] <-- each contribution has a different a span ID [ [cs] [cr] ], [ [sr] [ss] ]

Visually, someone could probably just intuitively see they are related. With a "kind" field (like kind.server, kind.client), you could probably guess with more accuracy that they are indeed the same op.

Am I understanding the "meta-trace" aspect as a resolution to the problem where contributors to the same operation do not share an id (and could, if there's was a distinct parent)?

ex. [ operation A ] [ operation A.client ], [ operation A.server ] <-- both add a parent ID of the above operation? [ [cs] [cr] ], [ [sr] [ss] ]

opentracing-importer commented 8 years ago

Comment by adriancole Tuesday Jan 19, 2016 at 03:33 GMT

I don't think we need to conflate support of multiple parents with widening the data type of the tag api, particularly this early in the game. For example, what if no api that supports multiple parents actually implements OT? We're stuck with the wide interface. I'd suggest folks encode into a single tag and leave complaints around that as an issue to work on later.

opentracing-importer commented 8 years ago

Comment by yurishkuro Tuesday Jan 19, 2016 at 03:33 GMT

My understanding of two spans per RPC approach is that the server-side span is a child of the client-side span. The main difference is in the implementation of the join_trace function - Zipkin v.1 implementation would implement join_trace by creating a span with the same trace_context it reads off the wire, while a "two-spans" tracer will implement join_trace by creating a child trace_context.

That is somewhat orthogonal to multi-parents issue. Any span can declare an additional parent span to indicate its casual dependency (a "join"). However, in case of two-spans per RPC it would be unexpected for a server-side span to declare more than one parent.

opentracing-importer commented 8 years ago

Comment by yurishkuro Tuesday Jan 19, 2016 at 03:54 GMT

I don't think we need to conflate support of multiple parents with widening the data type of the tag api, particularly this early in the game.

Isn't it what this issue is about, how to record multiple parents? I don't mind if it's done via set_tag or with set_parents(trace_contexts_list), but if we don't offer an API to do it, those existing systems with multi-parent support will have nothing to implement. FWIW, at Uber we're starting work right now to trace relationships from realtime requests to enqueued jobs, which is a multi-parent (meta-trace) use case, and it can be done with Zipkin v.1, mostly with some UI enhancements.

opentracing-importer commented 8 years ago

Comment by adriancole Tuesday Jan 19, 2016 at 04:18 GMT

I'm more comfortable with set_parents or the like than changing the tag api directly.

opentracing-importer commented 8 years ago

Comment by adriancole Tuesday Jan 19, 2016 at 04:24 GMT

and to be clear, my original intent was to determine if and how this impacts trace attributes (propagated tags) vs tags (ones sent out of band).

ex in both zipkin and htrace, the parent is a propagated field. In zipkin X-B3-ParentSpanId, and in HTrace, half of the span id's bytes.

One binding concern was if "adding a parent" is a user function? Ex. in HTrace the first parent is set always. Since parents are complicated, it is api affecting how they are used in practice.

opentracing-importer commented 8 years ago

Comment by yurishkuro Tuesday Jan 19, 2016 at 04:30 GMT

Arguably, in-band propagated parent-span-id in Zipkin is not necessary, it could've been sent out of band. It sounds like in AppNeta the multiple parent IDs are "tags", not propagated attributes. Does anyone know why exactly Zipkin decided to propagate parent ID?

opentracing-importer commented 8 years ago

Comment by adriancole Tuesday Jan 19, 2016 at 04:43 GMT

Does anyone know why exactly Zipkin decided to propagate parent ID?

Only a guess, but perhaps it is to ensure out-of-band spans don't need to read-back to figure out their parent id. I'm sure the answer can be discovered.

opentracing-importer commented 8 years ago

Comment by bensigelman Wednesday Jan 20, 2016 at 05:46 GMT

Getting back to the original subject (which is something I've been interest in since forever ago):

I'm personally most excited about use cases that – at some level – boil down to a shared queue. That certainly encompasses the buffered/flushed writes case as well as the event loop pushback I mentioned further up in the thread. In those cases, how much mileage can we get by giving the queue (or "queue": it may be a mysql server or anything that can go into pushback) a GUID and attaching those guids to spans that interact with them? It's different than marking a parent_id but seems (to me) to make the instrumentation easier to write and the tooling easier to build.

Thoughts?

(As for MapReduces, etc: I have always had a hard time getting monitoring systems that are built for online, interactive-latency applications to actually work well for offline, non-interactive-latency applications (like MR). The data models can seem so similar, yet the tuning parameters are often totally divergent. Maybe I just didn't try hard enough (or wasn't smart enough, etc, etc)! I do think it's academically interesting and am happy to keep hearing ideas.)

opentracing-importer commented 8 years ago

Comment by cmccabe Wednesday Jan 20, 2016 at 18:20 GMT

I don't think the buffered writes case in HDFS is similar to a queue. A queue typically has events going in and events coming out. The buffered writes case just has a buffer which fills and then gets emptied all at once, which is not the way a queue typically works. The HBase case doesn't even necessarily have ordering between the elements that are being processed in the join, which makes it even less like a queue.

Here are examples of things in HDFS that actually are queues:

the queue of sockets that we have accept()ed but not read the message from yet
the queue of requests (threads) waiting to take the FSN lock

We haven't seen a reason to trace these things yet (of course we might in the future). It is fair to say that so far, queues have not been that interesting to us.

Consider the case of an HBase PUT. This is clearly going to require a group commit, and that group commit is going to require an HDFS write and flush. If you create a new request on every "join," you would have to look at three different "HTrace requests" to see why this had high latency.

The PUT request
The HBase group commit request
The HDFS stream flush request

These things are all logically part of the same PUT request, so why would we split them? And if we did, how would the users get from one request to the next? The GUI tooling understands how to follow parents to chidren, but not how to look up arbitrary foreign keys. The DAG model of execution is closer to reality than the tree model, so why should we force a tree on things that aren't tree-like?

best, Colin

On Tue, Jan 19, 2016 at 9:46 PM, bhs notifications@github.com wrote:

Getting back to the original subject (which is something I've been interest in since forever ago):

I'm personally most excited about use cases that – at some level – boil down to a shared queue. That certainly encompasses the buffered/flushed writes case as well as the event loop pushback I mentioned further up in the thread. In those cases, how much mileage can we get by giving the queue (or "queue": it may be a mysql server or anything that can go into pushback) a GUID and attaching those guids to spans that interact with them? It's different than marking a parent_id but seems (to me) to make the instrumentation easier to write and the tooling easier to build.

Thoughts?

As for MapReduces, etc: I have always had a hard time getting monitoring systems that are built for online, interactive-latency applications to actually work well for offline, non-interactive-latency applications (like MR). The data models can seem so similar, yet the tuning parameters are often totally divergent. Maybe I just didn't try hard enough (or wasn't smart enough, etc, etc)! I do think it's academically interesting and am happy to keep hearing ideas.

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-173094820 .

opentracing-importer commented 8 years ago

Comment by bensigelman Thursday Jan 21, 2016 at 05:58 GMT

@cmccabe buffered writes can have, well, queuing problems... the buffer is an "intermediary" between the operations trying to write and the final resting place of the data. I agree that it's not a simple push/pop sort of producer-consumer queue, and I think that's what you're saying.

I'm interested by your comment that "queues have not been that interesting to us." Do you mean that HBase doesn't have queuing problems? And/or that users don't want to understand what's in the queue/intermediary when HBase is in pushback? Bigtable is admittedly a different system than HBase, but that was of great interest to me as a Bigtable user when the tabletserver my process was talking to became unresponsive. Were there tools that reliably helped in such scenarios? Not really. Would I have liked to use one? Absolutely.

Back to your question of why we would "split" the PUT, group commit, and stream flush: logically, I would prefer not to split them... that's what this thread is about, of course.

The DAG model in the abstract is sound. It is less clear in the presence of sampling, though... For instance, if sampling decisions are made at the root of a trace (i.e., when there's no inbound edge, regardless of whether it's a DAG or a tree), how do we expect to understand the history of the other PUTs/etc in our HBase group commit request if they weren't sampled?

So, the other spans involved in the group commit are either all sampled or not-all-sampled. If they're all sampled, the tracing system needs to be able to handle high throughput. If they're not all sampled, the tracing system will not be able to tell a complete story about queuing problems or other slowness involving the group commit.

For a tracing system that can afford to sample all requests, in my mind the presence of unique ids for specific queues opens the door to various useful UI features. If it would be helpful, I could try to describe such features... but IMO just assembling one gigantic DAG trace that includes everything in a batch as well as all of its downstream and upstream (transitive) edges is problematic from both a systems standpoint and a visualization standpoint without additional meta-information about the structure of the system and the various queues/intermediaries.

opentracing-importer commented 8 years ago

Comment by cmccabe Thursday Jan 21, 2016 at 07:46 GMT

On Wed, Jan 20, 2016 at 9:58 PM, bhs notifications@github.com wrote:

@cmccabe https://github.com/cmccabe buffered writes can have, well, queuing problems... the buffer is an "intermediary" between the operations trying to write and the final resting place of the data. I agree that it's not a simple push/pop sort of producer-consumer queue, and I think that's what you're saying.

Maybe my view of queues is too narrow. But when I think of a queue, I think of a data structure with a well-defined ordering, where I take out exactly the same elements that I put in, not some combination. Queuing also has a strong suggestion that something is going to be processed in an asynchronous fashion (although strictly speaking that isn't always true). None of those things always hold true for the examples we've been discussing, which makes me a little reluctant to use this nomenclature. Do you think "shared work" is a better term than "queuing"?

In particular, I think your solution of foreign keys is the right thing to do for asynchronous deferred work (which is the first thing that pops into my mind when I think of a queue) but I'm not so sure about shared work that is done synchronously.

I'm interested by your comment that "queues have not been that interesting to us." Do you mean that HBase doesn't have queuing problems? And/or that users don't want to understand what's in the queue/intermediary when HBase is in pushback? Bigtable is admittedly a different system than HBase, but that was of great interest to me as a Bigtable user when the tabletserver my process was talking to became unresponsive. Were there tools that reliably helped in such scenarios? Not really. Would I have liked to use one? Absolutely.

I agree that when things get busy, it is interesting to know what else is going on in the system. I (maybe naively?) assumed that we'd do that by looking at the HTrace spans that were going on in the same region or tablet server around the time the "busy-ness" set in. I suppose we could attempt to establish a this-is-blocked-by-that relationship between various requests... perhaps someone could think of cases where this would be useful for HBase? I wonder what advantages this would this have over a time-based search?

Back to your question of why we would "split" the PUT, group commit, and stream flush: logically, I would prefer not to split them... that's what this thread is about, of course.

The DAG model in the abstract is sound. It is less clear in the presence of sampling, though... For instance, if sampling decisions are made at the root of a trace (i.e., when there's no inbound edge, regardless of whether it's a DAG or a tree), how do we expect to understand the history of the other PUTs/etc in our HBase group commit request if they weren't sampled?

So, the other spans involved in the group commit are either all sampled or not-all-sampled. If they're all sampled, the tracing system needs to be able to handle high throughput. If they're not all sampled, the tracing system will not be able to tell a complete story about queuing problems or other slowness involving the group commit.

Certainly the group commit, by its very nature, combines together work done by multiple top-level requests. You can make the argument that it is misleading to attach that work to anything less than the full set of requests. But I think in practice, we can agree that it is much more useful to be able to associate the group commit with what triggered it, than to skip that ability. Also, this criticism applies equally to foreign key systems-- if the user can somehow click through from the PUT to the hdfs flush, doesn't that suggest a 1:1 relationship to the user even if one doesn't exist?

For a tracing system that can afford to sample all requests, in my mind the presence of unique ids for specific queues opens the door to various useful UI features. If it would be helpful, I could try to describe such features... but IMO just assembling one gigantic DAG trace that includes everything in a batch as well as all of its downstream and upstream (transitive) edges is problematic from both a systems standpoint and a visualization standpoint without additional meta-information about the structure of the system and the various queues/intermediaries.

If the shared work is "gigantic" that will be a problem in both the multi-parent and foreign key scenarios. Because I assume that you want the shared work to be traced either way (I assume you are not proposing just leaving it out). In that case we need to explore other approaches such as intra-trace sampling or somehow minimizing the number of spans used to describe what's going on.

Conceptually, having an hdfs flush span that has "foreign keys" to write requests A, B, and C seems very similar to having an hdfs flush span that has parents of A, B, and C. I don't understand why having a DAG of trees of spans is acceptable but just having a DAG of spans is not. Does this simplify the GUI?

Colin

—

Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-173465261 .

opentracing-importer commented 8 years ago

Comment by bensigelman Friday Jan 22, 2016 at 00:04 GMT

Hey Colin,

One final thing about "queue", the word: I don't much care what we call it, I'm just trying to find a word we can use to describe the concept. I guess I've often heard people talk about "queueing problems" in datastore workloads, but whatever term you want to use is fine by me.

Anyway, re your last paragraph:

Conceptually, having an hdfs flush span that has "foreign keys" to write requests A, B, and C seems very similar to having an hdfs flush span that has parents of A, B, and C. I don't understand why having a DAG of trees of spans is acceptable but just having a DAG of spans is not. Does this simplify the GUI?

Yeah, so, I don't really have strong opinions about the "DAG of trees of spans" vs "DAG of spans" question per se. Both could work... I was more interested in avoiding what otherwise seems (?) like an O(N^2) edge proliferation... Again, looking at a fictitious queue:

tail--> [C D E F G H I J K] <--head

If we say that C depends on D, E, ..., K, doesn't D depend on E, F, ..., K? I liked the idea of creating a guid for the flush buffer / queue / whatever-we-want-to-call it because each span would have the single reference to that guid and a tracing system could infer the dependency relationships between the various buffered items.

The unfortunate thing about what I'm proposing is that tracing systems need to be aware of a new sort of construct. But I was hoping it would offer a more "declarative" (for lack of a better word) way to describe what's going on.

opentracing-importer commented 8 years ago

Comment by cmccabe Tuesday Jan 26, 2016 at 03:45 GMT

On Thu, Jan 21, 2016 at 4:04 PM, bhs notifications@github.com wrote:

Hey Colin,

One final thing about "queue", the word: I don't much care what we call it, I'm just trying to find a word we can use to describe the concept. I guess I've often heard people talk about "queueing problems" in datastore workloads, but whatever term you want to use is fine by me.

I'm still unsure whether "a queue" is the right term for the generic concept of shared work we are talking about here. Wikipedia defines a queue as a "a particular kind of abstract data type or collection in which the entities in the collection are kept in order and the principal (or only) operations on the collection are the addition of entities to the rear terminal position, known as enqueue, and removal of entities from the front terminal position, known as dequeue." This doesn't seem like a very good description of something like a group commit, where you add a bunch of elements in no particular order and flush them all at once. It's not really a good description of something like an HDFS flush either, where you accumulate N bytes in a buffer and then do a write of all N. It's not like there are processes on both ends pulling individual items from a queue. It's just a buffer that fills, and then the HDFS client empties it all at once, synchronously.

Anyway, re your last paragraph:

Conceptually, having an hdfs flush span that has "foreign keys" to write requests A, B, and C seems very similar to having an hdfs flush span that has parents of A, B, and C. I don't understand why having a DAG of trees of spans is acceptable but just having a DAG of spans is not. Does this simplify the GUI?

Yeah, so, I don't really have strong opinions about the "DAG of trees of spans" vs "DAG of spans" question per se. Both could work... I was more interested in avoiding what otherwise seems (?) like an O(N^2) edge proliferation... Again, looking at a fictitious queue:

tail--> [C D E F G H I J K] <--head

If we say that C depends on D, E, ..., K, doesn't D depend on E, F, ..., K ?

We use multiple parents in HTrace today, in the version that we ship in CDH5.5. It does not cause an O(N^2) edge proliferation. Flush spans have a set of parents which includes every write which had a hand in triggering the flush. I don't see any conceptual reason why the individual writes should depend on one another. One write is clearly not the parent of any other write, since the one didn't initiate the other.

I liked the idea of creating a guid for the flush buffer / queue / whatever-we-want-to-call it because each span would have the single reference to that guid and a tracing system could infer the dependency relationships between the various buffered items.

The unfortunate thing about what I'm proposing is that tracing systems need to be aware of a new sort of construct. But I was hoping it would offer a more "declarative" (for lack of a better word) way to describe what's going on.

Hmm. Maybe we need to get more concrete about the advantages and disadvantages of multiple parents vs. foreign keys.

With multiple parents, a parent span can end at a point in time before a child span. For example, in the case of doing a write which later triggers a flush, the write might finish long before the flush even starts. This makes it impossible to treat spans as a flame graph or traditional stack trace, like you can in a single-parent world. This may make writing a GUI harder since you can't do certain flame-graph-like visualizations.

With foreign keys, we can draw "a dotted line" of some sort between requests. For example, if the write is one request and the flush is another, there might be some sort of dotted line between them in GUI terms. It's a bit unclear how to make this connection, though.

The other question is what the "foreign key" field should actually be. If it is a span ID, then it is easy for a GUI to follow it to the relevant "related request." It also makes more sense to use span IDs for things like HDFS flushes, that have no actual system-level ID. To keep things concrete, let's consider the HDFS flush case. In the "foreign key as span ID" scheme, the flush would switch from having write A, write B, and write C as parents to having all those spans as "foreign keys" (or maybe "related request spans"?) Aside from that, nothing would change.

Whether you choose multiple parents or foreign keys, you still have to somehow deal with the "sampling amplification" issue. That is, if 20 writes on average go into each flush, each flush will be 20x as likely to be traced as any individual write operation. That is, assuming that you really make a strong commitment to ensuring that writes can be traced all the way through the system, which we want to do in HTrace.

Colin

— Reply to this email directly or view it on GitHub https://github.com/opentracing/opentracing.github.io/issues/28#issuecomment-173755128 .

opentracing-importer commented 8 years ago

Comment by yurishkuro Friday Feb 19, 2016 at 15:23 GMT

To draw a conclusion on the impact on the API, can we agree on the following API?

span.add_parents(span1,  ...)

The method takes parent spans via varargs, and the spans do not have to belong to the same trace (solving meta-trace issue).

bhs commented 8 years ago

Sadly the importer script did not get all of the original thread. Oh well. Here's a link:

https://github.com/opentracing/opentracing.io/issues/28

wu-sheng commented 7 years ago

@bhs This is an issue, which belongs to ot-spec 1.1. And we have already released it. I think we should change the Milestone or something. Then close milestone 1.1.

bhs commented 7 years ago

@wu-sheng this isn't fully resolved. We support multiple parents at start()-time, but it is not possible to add references to other Spans/SpanContexts midstream in the current API.

wu-sheng commented 7 years ago

@bhs I understans, but at this issue page, it said this belongs to OpenTracing 1.1 Spec milestone. I think this is not right.

bhs commented 7 years ago

@wu-sheng good point – changed. Sorry for delay.