grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.9k stars 507 forks source link

The service graph supports nodes of the message queue middleware type #3348

Open fyuan1316 opened 7 months ago

fyuan1316 commented 7 months ago

Is your feature request related to a problem? Please describe.

In the current service graph, I find that the representation of relationships when making asynchronous service calls via message middleware is not intuitive, and I would prefer to be able to visualize the message middleware nodes.

Describe the solution you'd like

In the scenario where services are produced and consumed through the message queue middleware, the final graph only shows the invocation relationship between the producer and the consumer. It hides the message middleware node. This is intuitively no different from the use case where the client requests the server directly.

I think there may be the following benefits if we show message middleware type nodes:

In conclusion, I would suggest to display the message middleware directly on the service graph, and as far as the technical implementation is concerned, using a similar approach to the way current DB requests are handled seems to help us achieve this.

What do you guys think?


(To give an example of rabbitmq, below is a graph of the current state vs. the state after the requirements are implemented)

Now

image

Then

image

Describe alternatives you've considered

Additional context

joe-elliott commented 7 months ago

Currently I believe we add the database name to the server label and set the connection_type label equal to database. In your example we set the client/server names to the producer/consumer and set the connection type to messaging_system.

Do you think we should split the metric in two? Add a new label?

fyuan1316 commented 7 months ago

@joe-elliott Thanks for the quick reply!

Currently I believe we add the database name to the server label and set the connection_type label equal to database. In your example we set the client/server names to the producer/consumer and set the connection type to messaging_system.

Yes, my idea was to split the raw metrics in two, but with the server name taken from the attribute messaging.system of the span attributes (rabbitmq in this case). The two metrics represent producer->rabbitmq and consumer->rabbitmq.

Now client server connection_type
otel-demo-consumer otel-demo-provider messaging_system
Then client server connection_type desc
otel-demo-consumer rabbitmq messaging_system consumer reads message (Considering that this is a control flow and not a data flow, we should set the consumer to be the client)
otel-demo-provider rabbitmq messaging_system producer writes message
joe-elliott commented 7 months ago

I like this idea a lot.

@grafana/observability-traces-and-profiling will the service graph have any issues with the proposed change above? is there any special logic using the connection_type=messaging_system we need to be aware of before making this change?

aocenas commented 7 months ago

So it does not seem we use connection_type in any way (though maybe it would be nice to use it in some way). So if you have 2 metrics instead of one we will create 1 more node in between. We basically just treat each combination of client-server as an edge in the graph, creating nodes in the ends and that is it.

joe-elliott commented 7 months ago

Thx @aocenas!

In that base we can split the metric to create a queue node. @fyuan1316 how do you think we should split the duration histogram?

fyuan1316 commented 7 months ago

Thanks @joe-elliott !

I'm glad you like the proposal and are helping to confirm and move it forward. Yes, I tend to split it in two to represent the latency of production and consumption.

joe-elliott commented 7 months ago

I was mistaken on metrics generation when I asked that question. I suppose my question is when we create the service graph metrics we create two histograms. One for the client side and one for the server side duration. If we split this metric into two to create a "fake" queue node then we'll be missing the client and server side durations for the queue.

What values should we set these to? 0?

fyuan1316 commented 7 months ago

Sorry, I may have misunderstood. According to the database implementation, it is, and it looks like it sets ServerLatencySec to the same value as ClientLatencySec. Do you think this is ok?

joe-elliott commented 5 months ago

Apologies that I've let this sit for so long. I've had it open as a tab for a month now trying to get back to it.

Currently the Grafana service graph looks for the server side service graph histogram to create a connection like:

A -> B

If we make a second server side histogram to make Grafana render:

A -> queue -> B

what values do we place in it?

Having thought about this over the past month now :). I'm actually leaning toward asking Grafana to just render this middle node without Tempo changes. They have all information necessary to do so and it would keep metrics generator series down.

Related issues since we've been chatting about this: https://github.com/grafana/tempo/issues/3461 https://github.com/grafana/tempo/pull/3453

fyuan1316 commented 5 months ago

Thanks for your reply! I agree with your approach. It's a great idea to use existing information to render the middle node of the asynchronous call relationship. I also noticed that another proposal (#3453) provides a more complete solution for asynchronous latency in messaging systems. The handling of metrics might be better placed there.

joe-elliott commented 5 months ago

Sounds good! Would you like to create a Grafana issue and reference this one? I would be happy to chime in with support.

fyuan1316 commented 5 months ago

Great! I have created a Grafana issue and referenced this one. Thank you for your support!

kvrhdn commented 5 months ago

Hi, just wanted to share some additional details / make sure we are aligned.

When we track edges between services we try to determine the connection type based upon the span kind and present attributes. This is added as label connection_type to the metric. Possible values are empty (the default), messaging_system, database and virtual_node.

As an example, in the following topology service A sends a request to B through a messaging system Queue:

A ------> [ Queue ] ------> B

Tempo will create metrics with the labels:

(We don't have any labels that contain the name or ID of the messaging system. In hindsight this seems like a huge oversight.)

Luckily we already capture the latencies (1) and (2) in separate metrics:

A ------> [ Queue ] ------> B
^---(1)---^       ^---(2)---^

1) traces_service_graph_request_client_seconds = latency of requests between A and Queue 2) traces_service_graph_request_server_seconds = latency of requests between Queue and B

As an aside, https://github.com/grafana/tempo/pull/3453 is introducing a new metric traces_service_graph_request_messaging_system_seconds that measures the processing latency within the messaging system (3):

A ------> [ Queue ] ------> B
          ^--(3)--^

Current situation

The best Grafana can do with the current data is to mark a connection between two nodes as an async messaging system. When rendering the nodes it would check for connection_type, if it's messaging_system replace the --> for something with a block on it indicating there is a messaging system in between.

A -----> B                becomes                A ------> [ <Messaging system> ] ------> B

It can separately show the latencies between each node and the messaging system but the count will be shared.

Going forward: adding the name

As @fyuan1316 mentioned here (https://github.com/grafana/tempo/issues/3348#issuecomment-1916999679) we really should include the name of the messaging system in the node graph. We can grab this from the attribute messaging.system(part of the semconv).

I'd suggest adding a label to the existing metrics messaging_system="<value of messaging.system>". Grafana can use this value (if present) and use it inside the box.

This is backwards compatible: if the value is missing Grafana can default to a placeholder name.

fyuan1316 commented 5 months ago

Based on the current implementation, I agree with @kvrhdn that is the best Grafana could do. Adding a label representing the specific implementation of messaging_system would be better. What do you think? @joe-elliott

aocenas commented 5 months ago

I agree with @kvrhdn that marking edges in some way to show the request went through some messaging_system makes sense to me. Just want to make sure this is the final version though. I am not sure if the Going forward: adding the name section should aim at actually getting to a result that the issue originally requested.

I feel the original request is doable with changes but not sure if it's actually even beneficial. If I understand it correctly it would do something like this: tempo_queue

But the problem IMHO is that we have to merge edges and we are losing some granularity of the data. ie C1-S1 and C1-S2 will have to be aggregated into C1-queue and then on the other side C1-S1 and C2-S1 would have to be aggregated into queue-S1. I feel like such a transformation would not actually help better understand the latencies or errors in the system as we would lose the granularity of the data per client-server pair.

kvrhdn commented 5 months ago

I think a small thing we could do with A is to modify the arrow depending on the connection type, by for example making it a dashed line or giving it a colour. If we do this we have to consider that we can have multiple arrows between two nodes (both direct and through a queue).

I agree B would drop some information by collapsing arrows together. Maybe we can use interactions to bring out that info again? Say if you click on a node it highlights the nodes that it connects to and only shows the related latencies. If for example C1 sends requests to S1, clicking on C1 would highlight C1, queue and S1. If you hover over the queue -> S1 arrow it would show latencies related to C1 -> S1.

fyuan1316 commented 5 months ago

One small question, do we want the edge to represent the direction of the instruction flow or the data flow? If it represents the instruction flow, I think the consumer may need to be the initiator.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.