open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.9k stars 2.27k forks source link

[connector/servicegraph] The servicegraph supports nodes of the message queue middleware type #30856

Open fyuan1316 opened 7 months ago

fyuan1316 commented 7 months ago

Component(s)

connector/servicegraph

Is your feature request related to a problem? Please describe.

In the current implementation of the servicegraph, I found that the generated metrics ignore the message middleware when requested via the message middleware type (client and server in the metrics only mark producer and consumer services), which isn't an intuitive representation, and I'd prefer that the generated metrics support the visualization of the message middleware nodes in the graph.

Describe the solution you'd like

In the scenario where services are produced and consumed through the message queue middleware, the final graph only shows the invocation relationship between the producer and the consumer. It hides the message middleware node. This is intuitively no different from the use case where the client requests the server directly.

I think there may be the following benefits if we show message middleware type nodes:

In conclusion, I would suggest to display the message middleware directly on the service graph, and as far as the technical implementation is concerned, using a similar approach to the way current DB requests are handled seems to help us achieve this.

What do you guys think?


(To give an example of rabbitmq, below is a graph of the current state vs. the state after the requirements are implemented)

Now

image

Then

image

Describe alternatives you've considered

No response

Additional context

As this component is based on [Grafana Tempo's service graph processor] https://github.com/grafana/tempo/issues/3348

github-actions[bot] commented 7 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

JaredTan95 commented 7 months ago

I think this is a very meaningful enhancement.

jpkrohling commented 7 months ago

@mapno , what do you think?

mapno commented 6 months ago

Agree this would be a nice addition. It's been discussed in the past, but never implemented.

This requires changes in the connector and in the visualisations (AFAIK, Grafana). For the connector, an attribute that contains the messaging system's name or other identifier could be easily added. The OTel spec has semantic conventions for these attributes. Then, it's a matter of representing this new node in the Graph in Grafana.

cbos commented 6 months ago

I have been testing with JMS messaging between 2 services and face the same issue.

There is slightly 'complicating' factor: https://opentelemetry.io/docs/languages/java/automatic/configuration/#capturing-consumer-message-receive-telemetry-in-messaging-instrumentations

So both producer and consumer have there own trace. There is no parent-child relation. At first all the spans for this were dropped.

Then I changed the config to use virtualnodes.

connectors:
  servicegraph:
    store:
      ttl: 10s
      max_items: 10000
    virtual_node_peer_attributes: [messaging.destination.name, db.name, net.sock.peer.addr, net.peer.name, rpc.service, net.sock.peer.name, net.peer.name, http.url, http.target]

messaging.destination.name is attribute which can be used, in the same way as db.name etc. https://opentelemetry.io/docs/specs/semconv/messaging/messaging-spans/#messaging-attributes

That works for the producer side.

But virtualnodes is not implemented for consumer side. Otherwise the servicegraph would have been as mentioned with a virtual node in between with the queue name.

With JMS you can implement fire-and-forget. Then it has the same problem as database, the corresponding span will never come. If the message is pickup up later (or never) the corresponding 'child' will never come. In the code there is an exception for databases, as that is forseen that a database does not create a span. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/b01fd364d01962e666dc347eb13421053ea93bac/connector/servicegraphconnector/connector.go#L290 But in case of messaging systems, that is the same. If feels a bit weird that only database has this special place.

I think the situation of the database and a messaging system can be quite similar.

github-actions[bot] commented 4 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

t00mas commented 4 months ago

I agree this can be a good addition, I can see what can be done - please assign

t00mas commented 4 months ago

@fyuan1316 I've been following related issues and reading your feedback on the additional context linked, I understand this is where we're at currently:

However, none of that will create a new node in the service graph as you initially suggested, and I see the conversation here shifted away from this component for that feature.

Are there any other metric generation suggestions that should be taken into consideration, here or there?

fyuan1316 commented 4 months ago

Hi @t00mas, This PR https://github.com/grafana/tempo/pull/3453 introduces a new metric traces_service_graph_request_messaging_system_seconds for messaging system, which has recently been merged. I wonder if this will help?

t00mas commented 4 months ago

I think this is doable, let me see if I can draft a PR for this to sit behind a feature gate.

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.