Add the ability to store/retrieve incomplete/partial spans

vprithvi commented 6 years ago

For the first version of this feature, we shall assume the following

clients can flush and report partial spans; but not span deltas
duration is monotonically increasing
the final span has the longest duration

For e.g., a client can do the following time = 1s Report Span[traceID=1, spanID=2, duration=10s, operationName="someOperation"]

time = 2s Report Span[traceID=1, spanID=2, duration=20s, operationName="someOperation"]

time = 3s Report Span[traceID=1, spanID=2, duration=40s, operationName="someOperation", tags=...]

To support this, the backend would need the ability to resolve merge conflicts on spans. For V1, this means simply selecting the longest span.

On Cassandra, the jaeger-collector uses the model spanhash, guaranteeing that all partial spans are stored: https://github.com/jaegertracing/jaeger/blob/412baf60db217d6f41d7fdca04358069362b29ca/plugin/storage/cassandra/spanstore/dbmodel/converter.go#L45

https://github.com/jaegertracing/jaeger/blob/fcbd21078ab7ed03682964c187452778242d3423/model/span.go#L63-L66

On ElasticSearch, we use the index api, which performs upserts. https://github.com/jaegertracing/jaeger/blob/e52ecffbf69a572593504ea9acc4ff65854a3e9a/plugin/storage/es/spanstore/writer.go#L179-L182

To make matters more interesting, note that Jaeger supports storing Zipkin spans that share the same spanID for client and server spans. Jaeger adjusts these spanIDs during query time as seen here: https://github.com/jaegertracing/jaeger/blob/412baf60db217d6f41d7fdca04358069362b29ca/model/adjuster/span_id_deduper.go#L23-L31

Another point to note is that jaeger-collector makes no guarantee that spans are stored in the order that they are received.

Enhancements

Communicate that jaeger-query is serving incomplete spans to the user. One approach is to do something similar to https://github.com/jaegertracing/jaeger-ui/issues/132. We might have to enhance the Span model to store this information
Allow for jaeger-clients to report that the span it is sending is the final span. This allows us to remove the assumption about monotonically increasing duration.
Add support for span deltas, which might allow for clients to reduce state maintained. (Might be at odds with the previous point)
Enhance Span model to store the lineage of a span (whether it was generated from a Zipkin or Jaeger span), this allows for more robust merging behavior

For more context on use cases and prior discussion see https://github.com/jaegertracing/jaeger-client-java/issues/231

Similar tickets:

yurishkuro commented 6 years ago

I am not sure why we need phase 1-2 approach and restrictions. Could we not just merge all spans into one? Rules would be:

tags and lots are merged as sets
operationName and flags are taken from span with largest duration
start time is the earliest of all
duration can be either the largest or calculated from the latest endTime(i)=startTime(i)+duration(i)

Enhance Span model to store the lineage of a span (whether it was generated from a Zipkin or Jaeger span), this allows for more robust merging behavior

I don't think lineage is really relevant. Zipkin style spans should be deduped using span.kind = client/server distinction.

vprithvi commented 6 years ago

I am not sure why we need phase 1-2 approach and restrictions.

We don't really, if we agree on the merge rules, it may be simpler to implement the rules in one go.

We might still need the restriction that the final span has the longest duration, (or some other method of denoting the last span sent by a client for a specific span id)

de-robat commented 6 years ago

By the proposal made in https://github.com/jaegertracing/jaeger-client-java/issues/231 where you've already referenced this discussion, the last span would be either denoted by a specific flag such as beeing complete or by having a duration at all. I don't know if that's possible but all in between spans would probably still report 0 duration because while still being running the duration information is redundant after all or isn't it? At least that's what i've understood form that discussion.

phal0r commented 6 years ago

I was discussing with @black-adder about the approach described in https://github.com/jaegertracing/jaeger-client-java/issues/231 and want to summarize our findings here so far:

One important requirement is not to change the cassandra schema in order to keep updates of jaeger simple. Therefore I want to split the enhancements necessary in "before persistence" and "after persistence".

To allow for easier distinction of partial and complete spans after persistence, I would suggest to change the assumptions to

a partial span always has a duration of 0
the final span always has a duration > 0

Before persistence

Clients can optionally send an "incomplete" flag to indicate if this span is partial or not
Clients should report a duration of 0 for partial spans and the correct duration for the final span
Thrift SpanModel gets attribute "incomplete"
DomainSpan gets the attributes "incomplete" and "type" (which can be 'jaeger' or 'zipkin')
JsonSpan gets attribute "incomplete", but it will be populated only during incoming spans requests, not when reading from storage

The type attribute for the domain model is necessary to enable cassandra and es writers to distinguish between replacing partial jaeger spans logic and zipkin server and client have same spanId logic. The type will not be persisted.

The writer needs to distinguish between both aforementioned logics, cause UPSERTs will only work with a stable identifier (cassandra primary key and es document id). Current writer logic would not allow for this, because

the es indexer will generate a random document id for every call
the cassandra spanhash does not take into account, that a final span may have additional logs/tags/changes

In my approach we enhanced both writers as follows:

es writer in writeSpan function

if jaegerSpanType == model.JaegerSpanType {
_, err = s.client.Index().Index(indexName).Type(spanType).Id(string(jsonSpan.TraceID) + string(jsonSpan.SpanID)).BodyJson(&elasticSpan).Do(s.ctx)
} else {
_, err = s.client.Index().Index(indexName).Type(spanType).BodyJson(&elasticSpan).Do(s.ctx)
}

The Id(stable identifier) call is necessary to allow for UPSERT, but we don't want to replace zipkin spans with the same spanId, because of client/server spans having the same spanId. Therefore the distinction.

cassandra writer, enhancing DomainSpan.Hash function

//with supporting incomplete spans the hash needs to be unique for a combination
//of spanId and traceId, so the final representation of the same span
//will overwrite the incomplete version in cassandra
if s.Type == JaegerSpanType {
return enc.Encode(JaegerSpanHash{s.TraceID, s.SpanID})
} else {
return enc.Encode(s)
}

The primary key comprising of spanId, traceId and a hashed representation of the spans remains, we just switch to only hashing traceId and spanId for spans received from jaeger clients, thus having a stable identifier regardless of additonal logs/tags or whatsoever.

After persistence

Since we don't want to change the schema, we won't have any additional information if a span is partial or not. We could stick to the assumption, that a partial span always has a duration of 0 and a completed span always has a duration > 0.

-Jaeger-UI could distinguish and mark spans as partial, when duration is 0

In this approach no changes would be necessary in the deduper, since partial spans will always be replaced with the next incoming incarnation of this particular span
The clock adjuster (https://github.com/jaegertracing/jaeger/blob/412baf60db217d6f41d7fdca04358069362b29ca/model/adjuster/clockskew.go#L148) needs to check for spans with a duration of 0 in the following cases

parentSpan is partial and childSpan is complete -> parentDuration < childDuration, therefore no adjustment will be done with the current logic
parentSpan and childSpan are partial -> this would basically lead to a delta parentStartTime - childStartTime and therefore adjusting the child span to the same startTime the parent has
parentSpan is complete and childSpan is partial -> childSpan will be adjusted to parentDuration/2 if the childStartTime is before parentStartTime -> if childStartTime is between parentStartTime and parentEndTime no adjustments will be made

So we probably need to incorporate changes for 2. and 3. -> for 2 we should just check that childStartTime is adjusted to be >= parentStartTime, checking for endTimes makes no sense, since they are not known at this point -> for 3 we should check that childStartTime is between parentStartTime and parentEndTime, but what would we want to adjust it too? Maybe parentDuration/2 is already a good approach?

I can provide a Pull Request with the changes described here (except the ClockAdjuster), so you guys can get a more detailled impression of the changes, I just need to fix one last failing test and can reference it here then :)

yurishkuro commented 6 years ago

Why is it necessary to ensure upsert? I think a simpler solution is to merge spans on read.

phal0r commented 6 years ago

I took this approach, because only the last span is relevant (at least this is my assumption) and it is much friendlier for storage not to keep all intermediate representations. Do you think storage size is not an issue?

yurishkuro commented 6 years ago

Hasn't been in the cases we encountered. But in some cases it's not even an option, it's a requirement to save all partial spans. E.g. we have a system that executes long running (on the order of days) workflows, and it persists the intermediate state in the database. The start and finish events of a span may not even be in the same process.

phal0r commented 6 years ago

Ok, that makes sense. Would have been good to know that before :)

So with this approach, the question is, how to model a final span, right? In https://github.com/jaegertracing/jaeger/pull/728 @black-adder already suggested to use a flag. In my understanding adding a new finalFlag constant like (https://github.com/jaegertracing/jaeger/blob/master/model/span.go#L27) would allow to detect final spans without altering the db schema of cassandra, because it's a uint32 https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/cassandra/spanstore/dbmodel/model.go#L28 That removes the necessity for any other incomplete attribute and aligns quite well with the current data model (at least if i understood the purpose of flags correctly). The clients could easily implement setting the flag in their span.finish() methods.

With a isFinal flag my suggested approach with a duration of 0 to distinguish between final and partial spans wouldn't be necessary aswell.

yurishkuro commented 6 years ago

My concern with using the flags field is in changing its semantic meaning and scope. Right now we're using it merely to store the value of spanContext.flags, i.e. the thing that is propagated over the wire. @phal0r 's proposal would change it to also carry information that is only meaningful for the backend, and meaningless in the propagation format.

I don't have strong objections to using the duration field. It does have a few drawbacks:

it is theoretically possible, on system with low precision clocks (like milliseconds) for the duration to be 0 even on the finished span.
it's a poor man's substitution for a proper field in the Span model, mixing different concerns
it does not allow full ordering of the span fragments

The downsides of introducing a dedicated field on the Span is mostly the implementation churn where we need to add it to the data model, update DB schemas everywhere, etc, I think internally we went with the approach based on duration - @vprithvi can you confirm?

phal0r commented 6 years ago

@yurishkuro @vprithvi I agree with you. As mentioned before, using 0 as indicator is not necessary and all suggestions I made are more or less a workaround for not changing the data model for cassandra.

Actually I do not favor any implementation or implementation detail. I want to help with my suggestions to push the feature of persisting incomplete/running spans. I care for semantical correctness and do understand, that the flags are not meant for that. So how can we go on from here?

The issue I see in the current concept is, that I don't see any way to detect if a span is complete or still running or am I missing something?

yurishkuro commented 6 years ago

My preference is to add a new field to the model, semantically it's the cleanest way.

phal0r commented 6 years ago

Yeah, we now have traversed every property, that is flexible enough for enhancements, but came to the point, that there is no semantically acceptable place for this. So did you guys already put some work internally to this or should we go on here with the discussion?

yurishkuro commented 6 years ago

What we did internally was not aimed at capturing partial spans, so I'd ignore that.

aviranco commented 6 years ago

Faced an issue when long span that is not finished is not writing logs at all until its finished. Is this the issue relevant for this behaviour which is planned to take care of?

mustafaakin commented 6 years ago

The issue is still relevant for our use case for long running jobs, is there any plan for this?

phal0r commented 6 years ago

@mustafaakin Yes, this is relevant for use, too. I had a first running implementation with Jaeger + ES and the node-client, but we upserted the traces, what is in conflict with this approach. So my next simpler iteration will be to introduce a incomplete flag for spans and a changed cassandra schema based on merger logic described in this issue. My rough plan is to finish this until the end of june. I don't know if anybody else is actively working on this.

mustafaakin commented 6 years ago

I would like to help if needed @phal0r

phal0r commented 6 years ago

@mustafaakin thx, I will send you a message, once I engage this issue.

phal0r commented 6 years ago

I have implemented the changes and checked it with cassandra and elasticsearch storage.

koliyo commented 5 years ago

Being able to inspect long running requests before they are complete is a really important functionality for us as well! We have tasks that take anywhere from sub-second up to many hours.

Jaeger/opentracing would be the perfect tool if we could trace progress during the execution of the tasks tasks.

phal0r commented 5 years ago

I agree. I am working on finishing this implementation. There is one more change for the current pull request required and the discussion in https://github.com/jaegertracing/jaeger/issues/1119 needs to be resolved aswell, since this feature changes the cassandra schema.

sobvan commented 5 years ago

My approach to the long-running process traces issue right now is that the span the starts a new trace should indeed be finished shortly after created. New spans for tasks spawned by the long-running process can be a child of the main span. Maybe the UI rendering is not that clean this way, but at least the span is persisted.

Storing pending spans would be, of course, better.

Do you guys need a helping hand with closing this issue?

phal0r commented 5 years ago

Szia @istvanszoboszlai ,

any help is appreciated. We are working in #979 to integrate this feature. There is some discussion going on about the approach and the impacts. @de-robat and me will look for the next time window to move this forward. So, if you want to help, we could discuss this together.

sobvan commented 5 years ago

Hallo @phal0r,

Sorry for the slow response. Yes, I am happy to help, I hope I can. I am not that familiar with cassandra, I am more into elastic. Let me know when we can discuss it further.

MfG :)

blueforesticarus commented 1 year ago

Is it correct there is still no way to do this? The effect seems to be that without workarounds/hacks

things that never finish and
things the program was in the middle of doing when it crashed

are not recorded.

For context, I have a mid size rust program using tracing -> opentelemetry -> jaeger, and seemingly because of this issue an error 30s into startup results in NO logs being recorded (since the program aborts before the spans are de-allocated).

jeme commented 1 year ago

Interesting - Back when I created a performance/trace logging framework for one of our solutions. This was only designed to work for a single application/process - so each node would log their own piece and there would be no correlation between service boundaries - it was also a pain to post process the data because you would often do this when you needed to look at problems, meaning there could be terabytes of data to process, and finally - as you may have already guessed, this was not a proactive model. The the idea that if you started a trace, it should be persisted even if it never finished or if the process crashed was a core principle of the design - so maybe it can be inspirational piece for me to share?

The general idea was that what we know here as Spans does not exist at write time, instead you have "points in time" with associated data, so that is basically just simple line by line logging, nothing special there - except that there was a format to it, so it would look something like:

2023-08-17T10:00:00.3707657  ba632c9c.66e51bca.ba632c9c > { "request": "https://hello.world" }
2023-08-17T10:00:01.3707657  ba632c9c.66e51bca.ba632c9c - { "status": 200 }
2023-08-17T10:00:02.3707657  9406329f.2a6a6130.2a6a6130 > { "request": "https://deep.hello.world"  }
2023-08-17T10:00:03.3707657  9406329f.45561834.2a6a6130 > { "fetch": "SELECT * FROM foo", "source": "SQL1" }
2023-08-17T10:00:04.3707657  9406329f.45561834.2a6a6130 - { "results": 42 }
2023-08-17T10:00:05.3707657  cd177bb2.5996db7f.cd177bb2 > { "request": "https://hello.world" }
2023-08-17T10:00:06.3707657  ba632c9c.66e51bca.ba632c9c < { }
2023-08-17T10:00:07.3707657  cd177bb2.5996db7f.cd177bb2 - { ...extra data... }
2023-08-17T10:00:08.3707657  cd177bb2.5996db7f.cd177bb2 - { ...extra data... }
2023-08-17T10:00:10.3707657  9406329f.45561834.2a6a6130 < { ...extra data... }
2023-08-17T10:00:11.3707657  cd177bb2.5996db7f.cd177bb2 < { ...extra data... }
2023-08-17T10:00:12.3707657  9406329f.2a6a6130.2a6a6130 < { ...extra data... }

Markers:

> was indicative of "STARTING"
- was indicative of "MORE INFO"
< was indicative of "COMPLETED"

In post processing that would be grouped to:

2023-08-17T10:00:00.3707657  ba632c9c.66e51bca.ba632c9c > { "request": "https://hello.world" }
2023-08-17T10:00:01.3707657  ba632c9c.66e51bca.ba632c9c - { "status": 200 }
2023-08-17T10:00:06.3707657  ba632c9c.66e51bca.ba632c9c < { }

2023-08-17T10:00:02.3707657  9406329f.2a6a6130.2a6a6130 > { "request": "https://deep.hello.world"  }
 -- 2023-08-17T10:00:03.3707657  9406329f.45561834.2a6a6130 > { "fetch": "SELECT * FROM foo", "source": "SQL1" }
 -- 2023-08-17T10:00:04.3707657  9406329f.45561834.2a6a6130 - { "results": 42 }
 -- 2023-08-17T10:00:10.3707657  9406329f.45561834.2a6a6130 < { ...extra data... }
2023-08-17T10:00:12.3707657  9406329f.2a6a6130.2a6a6130 < { ...extra data... }

2023-08-17T10:00:05.3707657  cd177bb2.5996db7f.cd177bb2 > { "request": "https://hello.world" }
2023-08-17T10:00:07.3707657  cd177bb2.5996db7f.cd177bb2 - { ...extra data... }
2023-08-17T10:00:08.3707657  cd177bb2.5996db7f.cd177bb2 - { ...extra data... }
2023-08-17T10:00:11.3707657  cd177bb2.5996db7f.cd177bb2 < { ...extra data... }

And then sort translated roughly to:

span: ba632c9c.66e51bca.ba632c9c
 - start: 2023-08-17T10:00:00.3707657
 - duration: 6 seconds
 - data: { ... } 

span: 9406329f.2a6a6130.2a6a6130
 - start: 2023-08-17T10:00:02.3707657
 - duration: 10 seconds
 - data: { ... }
 - childs:
   - span: 9406329f.45561834.2a6a6130
     - start: 2023-08-17T10:00:03.3707657
     - duration: 7 seconds
     - data: { ... }

span: cd177bb2.5996db7f.cd177bb2
 - start: 2023-08-17T10:00:00.3707657
 - duration: 6 seconds
 - data: { ... }

If a process failed midway the completing lines would be missing but we would still have to start and perhaps we would have a number of "children" as well, which can be as deeply nested as one could think of.

Relations was basically build out of: cd177bb2.5996db7f.cd177bb2 -> {correlation}.{id}.{parent} Since that is already part of what Jaeger does, it's just part of the example for understanding it as a different approach, I would not suggest this being changed in Jaeger.

I am not sure if that can provide any ideas/inspiration, but feel free to ask questions. I would love to see spans being persisted even if they never completed.

jaegertracing / jaeger