Persisting incomplete spans

phal0r commented 7 years ago

I have a question/ feature request:

We want to utilize Jaeger as Tracing/Logging framework for processes and web services. It is important for us that a span gets persisted once it's created and not only when it is finished.

Let's assume the following scenario: A process which has some substeps gets executed and runs like 5-10 minutes. The process creates a span with its name and some tags. Also, every substep creates a span. Once finished the span is being persisted with the next client flush. This means for the first span, that it's persisted after 10 minutes. To monitor the execution, it is important for us that the first span is being persisted after creation and updated when finished.

(just to clarify: by process I mean hierarchical steps of work running in the JVM not a os process)

I think this would be a nice feature in general as it would be possible to see long running spans and spans that never finish, cause the applications shuts down, has connectivity problems or other issues during processing that prevent it to send the span to the collector at the end.

What do you think about this?

yurishkuro commented 7 years ago

I think it's doable. The backends (C and ES) can store spans with the same span ID, which comes in handy when accepting Zipkin spans. We'd just need to have a special marker that would tell the backend it's ok to merge start/finish segments of the same span.

The more difficult work would be on the side of instrumentation libraries. To avoid degrading the performance for applications that do close spans quickly (majority of them) we wouldn't want to always flush the start-segment. It could be a timeout parameter in the client libs that says "if my span hasn't finished after the timeout, flush it as a segment".

mabn commented 7 years ago

One good use case for this is a process which processes requests or messages and dies - e.g. gets killed by kernel after exceeding memory limit. In this case it would be very desirable to know which request started but hasn't finished processing as it's probably the culprit, especially if the kill happened several time for the same or similar requests.

But information would need to be pushed to the agent after receiving the message and before dying.

phal0r commented 7 years ago

Thanks for the feedback.

@mabn 100% agreed :)

@yurishkuro I agree that it should not degrade performance. What about the flushInterval of the remote reporters? They already provide a debouncing mechanism and short spans will be merged/overridden in the client before they are flushed to the agent or do you think this will have a performance penalty in the client itself?

Can you make any assumptions on how long the implementation might take or where I could be of any help?

yurishkuro commented 7 years ago

deduping the spans in the reporter buffer is going to be more complicated, I think. The timeout approach I described seems to solve the perf issue.

To implement this in one language client and the backend shouldn't take more a few days. If you want to take a stab at it it would be great, as I don't think we'll have cycles to get to this any time soon.

vprithvi commented 7 years ago

Another concern to keep in mind is the Jaeger datapipeline. Internally, Jaeger collectors emit spans to Kafka, which are consumed by Flink jobs. This change would break assumptions that some of these jobs make, I suggest adding a span tag to mark the span as partial, so that jobs can choose to only work on finished spans.

black-adder commented 7 years ago

@brianwolfe ticket about saving multiple events for a single span ^

phal0r commented 7 years ago

Hey guys,

I was quite busy the last couple of weeks, but tried to wrap my head around this issue. Here are some thoughts on this based on your feedback:

Tracing Clients would get a configuration property reportIncompleteThreshold: int milliseconds
Tracing Clients need a running Task checking if currentTime - spanStartTime > reportIncompleteThreshold periodically
If this is the case, the span needs to be send to the processing queue of the reporter by calling reporter.report(span)
If this feature is not configured, the Tracing Client would not start the Task and just function as it now does
The Thrift Model needs to be enhanced by a new attribute: optional span attribute "completed" -> bool
The Collector needs to be able to INSERT or UPDATE the span -> I have checked this and both Cassandra and Elasticsearch support UPSERT, i.e. the collector can do an update on the span and both don't care if it existed before or not -> For ElasticSearch the Writer would need to use the update command instead of the do command according to the example of the es client (https://olivere.github.io/elastic/) -> maybe suggestions like https://github.com/jaegertracing/jaeger/issues/439 could be incorporated in this change too for the elasticsearch storage (just thinking out loud)

(I overread, that you already mentioned it, damn me)

The UI should visualize the different states complete -> true/false
I assume that there are no changes necessary in the agent (didn't check it, though)

What do you guys think? Am I on the right track? I could start with the node.js client, since I don't know the best practice to handle this with the command queue in the java client. If you could point me in the right direction, I could start there aswell.

yurishkuro commented 7 years ago

That looks like the right plan, but it should really start on the backend, not in the clients.

phal0r commented 6 years ago

@yurishkuro Yeah, I expressed myself unclear. I will start with the backend, of course, but I will test it first with the node.js client, not the java client :)

ringerc commented 6 years ago

This would be immensely desirable for me, as I'm instrumenting PostgreSQL. Long-running transactions are not uncommon, and visibility into what they're doing is one of the most useful things Jaeger could offer.

I deal with multi- day transactions sometimes, performing millions of queries (or just a few big ones) and writing terabytes of data.

One challenge is recording details at a useful level, but that's quite a separate topic. An immense help would be to be able to see what's happening while it's happening not afterwards.

I'm using the C++ agent, but it seems mos of the work is on the backend, right?

vprithvi commented 6 years ago

but it seems mos of the work is on the backend, right?

There is work on the backend to ensure inserts are updated to be upserts, but I'd argue that the majority of work is going to be in the client.

phal0r commented 6 years ago

@ringerc This approach will send the span after a configurable timeout, if it is not finished until then and a second time when finished.

As I am making progressing, I describe would I did so far:

I have implemented the logic into the node client (java client will follow)
I have changed the data model with the completed flag as described above
I have changed the inmemory storage to check for spans with the same spanId in a trace and then replace not append the currently reported span
I have changed the elasticsearch writer to set traceId+spanId as document id so the indexer will automatically replace existing spans with the same spanId an a trace

I have reread @yurishkuro 's comment:

The backends (C and ES) can store spans with the same span ID, which comes in handy when accepting Zipkin spans. We'd just need to have a special marker that would tell the backend it's ok to merge start/finish segments of the same span.

The Cassandra Schema is doing the same: PRIMARY KEY (trace_id, span_id, span_hash) to define the primary key, but wouldn't it break when reporting spans with the same span id in one trace?

Next, I am checking, that Cassandra will work with this approach, too. So, if I am on the wrong way with the elasticsearch approach, let me know. I didn't check the storage integration tests yet, just the normal test suites.

black-adder commented 6 years ago

Currently in cassandra, when you write a span, we hash the entire span to get the span_hash. This means that if the span is different in any way, you will get a different span_hash thus writing two separate entries into cassandra for the same span. We can either fix the hash function that we're using so that the hash is always the same for the span (ignore completed flag, etc.) to get idempotent writes OR we can just write the same span twice with different hashes and when we query it, have the query service reconcile the two spans into one.

phal0r commented 6 years ago

@black-adder Thanks for the insights. What are the reason for the hashing approach? a) Are the generated spanIds of the clients not unique in a trace? b) A Trace can be comprised of zipkin and jaeger spans and therefor spanIds are not unique?

a -> Why not switch to uuid's? b -> Introduce a type property and use it for the primary key to distinguish between jaeger and zipkin spans

I would suggest, that a spanId or a composite key, like suggested in b, should be unique in a trace.

I bring this up, cause I changed the elasticsearch writer to replace spans with the same combination of spanId and traceId. When I get you right, this changes current behaviour. I don't think that hashing is a feasible solution here, since tags and logs can also change between the incomplete and complete incarnation of the span.

In your approach, how would the query service distinguish between a span that has a complete and incomplete incarnation and other spans with the same spanId? Imho that would just move the problem from writer to query service.

black-adder commented 6 years ago

Hi, sorry for the delay. We have to hash because of b since Jaeger accepts zipkin spans in which the server and client spans share the same spanid, they need different hashes to differentiate when we store them.

(I need to look through how the ES is doing writes because at the face of it, it seems like only one version of the span will be written unless ES appends new documents if the key is the same.)

In your approach, how would the query service distinguish between a span that has a complete and incomplete incarnation and other spans with the same spanId? Imho that would just move the problem from writer to query service.

Fair point. Initially I thought we could use the new thrift.completed flag to differentiate the initial span and the finished span. However, if we are to submit multiple intermittent spans, this would of course not work. We could use the duration (assuming duration will always be increasing with subsequent span reports) or add a new tag inside the span to differentiate it. I'm not the biggest fan of this approach.

The main issue we have is that we support writing zipkin spans. This in itself isn't a problem but the issue is that the zipkin "adjustment" happens during query time i.e. query service takes the zipkin server span id and generates a new one so that it fits the jaeger model. We can't do this in the collector because if we are to change the span id, we need to propagate this to all downstream child spans but at the time of writing the downstream child spans, we don't have this information available (unless we shard our collectors by traceid). This means we are stuck with writing zipkin spans with span hash as is.

b -> Introduce a type property and use it for the primary key to distinguish between jaeger and zipkin spans

I think this can work. We can augment the collector's jaeger domain model with a new flag is_zipkin and for these spans, we can use the existing span_hash function to hash the entire span. For jaeger spans, we can use just the trace_id and span_id to hash the span. This doesn't remove the span_hash key from the cassandra index since this would be a breaking change and we'd have to write a migration script for users however it does work around current constraints. (@vprithvi this might be beneficial for you)

My only concern is what happens if somehow the spans arrive at the collector out of order (ie an intermediate span arrives after the finished span). This means that the intermediate span will be the final span to be persisted since jaeger span writes are now idempotent. I don't see this realistically occurring however.

phal0r commented 6 years ago

Hi @black-adder , thanks again for the in-depth answer. To tackle these problems, we implemented the following:

model.Span has a string property Type, which is set in the ToDomainSpan functions to model.ZipkinSpanType and model.JaegerSpanType accordingly, which we introduced as constants
we enhanced model.Span.Hash like so
```
if s.Type == JaegerSpanType {
return enc.Encode(JaegerSpanHash{s.TraceID, s.SpanID})
} else {
return enc.Encode(s)
}
```
where JaegerSpanHash consists of the traceId and spanId, so no changes will be necessary in the cassandra data model.
we made a switch in the elasticsearch writer to set the document id for model.JaegerSpanType, but not for model.ZipkinSpanType

So this makes sense to me. The only thing we are fighting with is, where should we initialize the model.Span.Completed property with true, when no completed flag is present in the incoming span, since this must be the default value.

My only concern is what happens if somehow the spans arrive at the collector out of order (ie an intermediate span arrives after the finished span). This means that the intermediate span will be the final span to be persisted since jaeger span writes are now idempotent.

I share your concern, but I can't think of a way to check for these cases without getting the current state from the persistence layer to decide if it should be dropped or not.

However, if we are to submit multiple intermittent spans, this would of course not work.

Current implementation is done with one incomplete and one final span, so no intermittent spans are sent.

Initially I thought we could use the new thrift.completed flag to differentiate the initial span and the finished span.

You mean, you would differentiate zipkin and jaeger spans by this property to decide how it should be persisted?

vprithvi commented 6 years ago

I think we should address the backend discussion in https://github.com/jaegertracing/jaeger/issues/729

tsipo commented 5 years ago

Hi there, any update on this one? We would like to use opentracing/jaeger for various spans, which duration may vary from sub-second to many hours. The current implementation (send only of finished spans) introduces the following problems:

Having the span visible only once finished is not really an option for long spans.
Long/large spans can easily suffer from packet size (~64K for UDP agent, even the 1MB for HTTPSender can be an issue). If a span size exceeds the packet size, it is dropped by the client.
You also have a chance that the process which created the span may end, either an orderly shutdown or not. While debugging I've noticed that the relevant jaeger client thread is a daemon thread, so the JVM may exit without letting the sender send the span out.

For all of these reasons having the client capable of sending unfinished spans is a much-needed feature! Thank you.

jpkrohling commented 5 years ago

@tsipo would you mind describing your use case? Why do you need to keep the spans unfinished for so long? Wouldn't it be appropriate to have your long-running span to be a "follows-from" span from the originating process?

tsipo commented 5 years ago

@jpkrohling We have a microservice mesh where the root span can be created as a result of an HTTP request (either UI or REST API call) or a scheduled job. Some of these may trigger rather lengthy, heavy computations and/or data access, some of it parallel/asynchronous and some sequential/synchronous, and may involve multithreaded processing and/or gRPC calls to other microservices.

The FollowsFrom reference type is more appropriate for a fire-and-forget use case, IMHO, which is not our case. The root span operation - surely its duration and error state - clearly depends on the child spans, it is NOT fire-and-forget. The root span may also want to add logs, depending on results of child spans.

I could try to bypass it by creating a child trigger span (child of the root span), add to it the relevant tags of the root span, finish it right away (so that to make it visible), then make all of the child spans children of this trigger span instead of the (yet unfinished) root span. It does feel like an artificial placeholder, though.

jpkrohling commented 5 years ago

Thanks for the explanation, I think it does make sense to have incomplete spans persisted in your case.

tsipo commented 5 years ago

@jpkrohling Do you know if there's a timeline for this (and the backend related issues)? Thanks!

jpkrohling commented 5 years ago

There's no timeline for this issue (nor for https://github.com/jaegertracing/jaeger/issues/729). If you'd be open to contributing this feature to Jaeger, let me know and I can show you how to get started.

phxnsharp commented 2 years ago

@yurishkuro Is this closed because it has been fixed, or because it won't be implemented, or something else? Thanks :)

yurishkuro commented 2 years ago

won't be implemented, library is deprecated

jaegertracing / jaeger-client-java

Persisting incomplete spans #231