jaegertracing / jaeger-ui

Web UI for Jaeger
http://jaegertracing.io/
Apache License 2.0
1.15k stars 484 forks source link

Relative times for logs #686

Open nvx opened 3 years ago

nvx commented 3 years ago

Requirement - what kind of business use case are you trying to solve?

Viewing log times in a trace can reveal interesting timing information and is a very handy feature.

Problem - what in Jaeger blocks you from solving the requirement?

When a span starts long into a trace (eg, an hour later) which is common for async processes working from pub-sub queues, the times shown in logs are relative to the start of the root trace. Display options means that precision is lost and only "1h 44m" is displayed for all of the logs, losing the ability to determine timing information between the logs.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Allow logs times to be shown relative to the span rather than relative to the root trace start. This would allow more useful times in some situations. A button to toggle between the two options might be useful as seeing absolute times (from the start of the trace) likely has some merits in some cases as well.

roli25 commented 3 years ago

Can I work on this?

yurishkuro commented 3 years ago

@roli25 got for it

jkowall commented 3 years ago

The use of tracing for long-running transactions or processes is just not a good use case. It causes a lot of issues around sampling and analyzing the data when transactions span hours or even days. I would suggest using logging systems versus tracing systems for this type of use case. There were companies that tried to solve this issue for APM but they generally end up using a lot of aggregation of metrics on specific types of processes. ex: https://www.driven.io/apm-big-data-apps/ | https://www.unraveldata.com/technologies/hadoop/.

In Jaeger for these long transactions, I assume you have very long spans or a lot of "dead air" where a process is waiting for an event to occur. How do you even use that data to optimize the performance when you do not know wait times versus processing times? What about timeouts for things that fail?

nvx commented 3 years ago

In Jaeger for these long transactions, I assume you have very long spans or a lot of "dead air" where a process is waiting for an event to occur. How do you even use that data to optimize the performance when you do not know wait times versus processing times? What about timeouts for things that fail?

Honestly the main use is just filtering down to the specific operation and looking at that individual component rather than the trace as a whole.

It's useful to be able to see what triggered X to occur, so simply breaking the relationship in the trace loses that.

I haven't kept up with the OpenTelemetry spec, but in OpenTracing land there's both a ChildOf relationship, as well as a FollowsFrom relationship. Currently Jaeger treats both the same way for display purposes as far as I can tell but the usecase I'm describing is better described by the FollowsFrom relationship than ChildOf, perhaps an option to display FollowsFrom traces as if they were separate root traces and having a separate view to see the parent that caused them.

Given a trace X with a bunch of child spans, some of which relating with the FollowsFrom relationship, I can see a short bit of info saying "this span triggered a trace X with FollowsFrom" that links off to what otherwise looks like a root trace (that has a backlink) rather than showing all the FollowsFrom child spans combined with the ChildOf spans would solve this and a few other issues with this kind of async workflows.

As for how things appear at the moment, there's usually the root span at the left of the timeline, with a few children spans (which trigger one or more async processes), then some dead air with no spans covering the area then some interesting stuff again at the end. Sometimes one root span might trigger a dozen or more separate async processes, but these are independent of each other so seeing them all together doesn't provide a huge amount of value, the value is knowing what the upstream trigger was, and the relative timings of that operation.

yurishkuro commented 3 years ago

Jaeger UI already supports displaying Span References pointing to other traces, have you tried that? You can just download a JSON of a trace, edit it to point to another trace, and load back into the UI to see how it looks.

To my knowledge, the only effect of using follows-from instead of child-of in Jaeger is that the clock skew adjustment will not be performed on follows-from (although still might, since there is an implication that the 2nd span must at least start after the first, so that adjustment is still valid - you'd need to look at the source code in the backend).