Open fyuan1316 opened 9 months ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
I think this is a very meaningful enhancement.
@mapno , what do you think?
Agree this would be a nice addition. It's been discussed in the past, but never implemented.
This requires changes in the connector and in the visualisations (AFAIK, Grafana). For the connector, an attribute that contains the messaging system's name or other identifier could be easily added. The OTel spec has semantic conventions for these attributes. Then, it's a matter of representing this new node in the Graph in Grafana.
I have been testing with JMS messaging between 2 services and face the same issue.
There is slightly 'complicating' factor: https://opentelemetry.io/docs/languages/java/automatic/configuration/#capturing-consumer-message-receive-telemetry-in-messaging-instrumentations
So both producer and consumer have there own trace. There is no parent-child relation. At first all the spans for this were dropped.
Then I changed the config to use virtualnodes.
connectors:
servicegraph:
store:
ttl: 10s
max_items: 10000
virtual_node_peer_attributes: [messaging.destination.name, db.name, net.sock.peer.addr, net.peer.name, rpc.service, net.sock.peer.name, net.peer.name, http.url, http.target]
messaging.destination.name
is attribute which can be used, in the same way as db.name etc.
https://opentelemetry.io/docs/specs/semconv/messaging/messaging-spans/#messaging-attributes
That works for the producer side.
But virtualnodes is not implemented for consumer side. Otherwise the servicegraph would have been as mentioned with a virtual node in between with the queue name.
With JMS you can implement fire-and-forget. Then it has the same problem as database, the corresponding span will never come. If the message is pickup up later (or never) the corresponding 'child' will never come. In the code there is an exception for databases, as that is forseen that a database does not create a span. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/b01fd364d01962e666dc347eb13421053ea93bac/connector/servicegraphconnector/connector.go#L290 But in case of messaging systems, that is the same. If feels a bit weird that only database has this special place.
I think the situation of the database and a messaging system can be quite similar.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
I agree this can be a good addition, I can see what can be done - please assign
@fyuan1316 I've been following related issues and reading your feedback on the additional context linked, I understand this is where we're at currently:
messaging.system
, or any other messaging.*
dimensions can be added as labels with client_
/ server_
prefixes alreadyconnection_type
will be set as messaging_system
However, none of that will create a new node in the service graph as you initially suggested, and I see the conversation here shifted away from this component for that feature.
Are there any other metric generation suggestions that should be taken into consideration, here or there?
Hi @t00mas,
This PR https://github.com/grafana/tempo/pull/3453 introduces a new metric traces_service_graph_request_messaging_system_seconds
for messaging system, which has recently been merged. I wonder if this will help?
I think this is doable, let me see if I can draft a PR for this to sit behind a feature gate.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Component(s)
connector/servicegraph
Is your feature request related to a problem? Please describe.
In the current implementation of the servicegraph, I found that the generated metrics ignore the message middleware when requested via the message middleware type (client and server in the metrics only mark producer and consumer services), which isn't an intuitive representation, and I'd prefer that the generated metrics support the visualization of the message middleware nodes in the graph.
Describe the solution you'd like
In the scenario where services are produced and consumed through the message queue middleware, the final graph only shows the invocation relationship between the producer and the consumer. It hides the message middleware node. This is intuitively no different from the use case where the client requests the server directly.
I think there may be the following benefits if we show message middleware type nodes:
In conclusion, I would suggest to display the message middleware directly on the service graph, and as far as the technical implementation is concerned, using a similar approach to the way current DB requests are handled seems to help us achieve this.
What do you guys think?
(To give an example of rabbitmq, below is a graph of the current state vs. the state after the requirements are implemented)
Now
Then
Describe alternatives you've considered
No response
Additional context
As this component is based on [Grafana Tempo's service graph processor] https://github.com/grafana/tempo/issues/3348