grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.93k stars 510 forks source link

TraceQL: measure durations between multiple spans #2665

Open dburakovtechsolutions opened 1 year ago

dburakovtechsolutions commented 1 year ago

Is your feature request related to a problem? Please describe.

We are working on asynchronous tracing for Robot Operating System (ROS). We have a back-end server that generates commands and a robot that executes them. We can generate traces for the aforementioned arrangement. An example of such trace is on the screenshots below.

The robot consists of several ROS Nodes that communicate in a Pub/Sub manner. You can think of a Node as an asynchronous worker that receives some data, processes it, and produces new data to be consumed by other workers.

Now we want to answer the question: "How much time does a command spend within ROS?". In the example below, the answer to the question is node_c.start_time - node_a.start_time + node_c.duration = 10.98ms − 5.11ms + 0.189ms = 6.059ms

The problem is, it seems that we cannot do such calculation using TraceQL.

Describe the solution you'd like

Since spans already contain Start Time and Duration fields, we want to use it something like that:

node_a = {name = "Node A"} && node_c = {name = "Node C"} |  node_c.start_time - node_a.start_time + node_c.duration > 10ms 

Similar concept was already described in the TraceQL design proposal document:

A trace sees network latency > 1s when passing between any two services: { parent.service.name != .service.name } | max(parent.duration - duration) > 1s

Describe alternatives you've considered

We can process raw trace data, in a way I have described it above, but TraceQL seems to be more general solution

Additional context

Example of a trace: trace_example

Example of a trace with span fields expanded: trace_example_expanded

mdisibio commented 1 year ago

Hi, thanks for writing this up, the detail and screenshots are extremely helpful. Timestamp analysis is an interesting area and there are others with similar needs to determine "uninstrumented time", but also isolate/resolve clock skew between spans. We can talk short-term and long-term ideas.

Short term - Is it possible to record a parent span that covers all of the work beginning with Node A and ending with Node C? This is more of a work-around using the current featureset, but if possible it provides a lot immediate wins: search/view this new span's duration with traceql, and the metrics-generator provides metrics over time (graph the p50/p99).

Long term - We definitely want to do something in this area and are internally speccing out new language elements. I've included this, but it will take some time. Your query introduces variables which isn't something on our radar but seems to fit well here. However it would need to work a bit differently because node_a = {name = "Node A"} can match multiple spans, the next expression would need a way to choose one perhaps node_a[0].start_time. parent.duration would not work well here because the nodes A and C are split, and the current block format would require loading all spans to find the parent (quite inefficient and something we hope to solve). TraceQL can do math between aggregates like min(duration) - max(duration), and if we added a way to address the span start time it could be close, but a query structure that isn't well tested. Something like ({nodeC} | min(starttime)) - ({nodeA} | min(starttime)) + ({nodeC} | max(duration)) > 10ms. A lot of approaches to consider.

dburakovtechsolutions commented 1 year ago

Hello, Martin. Thank you for the response!

For the short term suggestions, it seems that we are not able to record a common parent span since the processing is asynchronous.

For example, as soon as Node A finishes its computation, it exits the current scope, consequently closing the span. Node A does not know anything about the existence of any other node and does not know how many messages are on the way to be processed. It is similar to having multiple asynchronous services.

As for the clock skew issue, in our case all of the nodes are running on the same machine and can use timestamps from the same clock source. However, next steps could include running nodes on multiple machines, which will reintroduce the issue.

For the long term solution, maybe we can consider trace level searching

For example:

"Find all traces, such that they contain spans with the names NodeA and NodeC" | trace.span["NodeC"].start_time - trace.span["NodeA"].start_time + trace.span["NodeC"]. duration > 10ms

Can we help with adding some of the approaches?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.