grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.93k stars 509 forks source link

Trace data is being generated even though there are no request from service A to service B #3689

Closed finda-yeongjo closed 2 months ago

finda-yeongjo commented 4 months ago

Describe the bug When monitoring Java applications using OpenTelemetry Java Auto-Instrumentation, the trace data incorrectly shows service A calling service B (A -> B), even though there is no actual call between A to B. Based on the concept of microservices, A and B are producing and consuming data through a Kafka topic, maintaining "loose coupling" between each other. This issue is evident in the traces_service_graph_request_total metric and the Zipkin trace data, which suggests a relationship that does not exist.

스크린샷 2024-05-20 오전 10 44 48

If I enable the following two options in the OpenTelemetry instrumentation, Service A changes to "user," but the data is still identified.

      - name: OTEL_INSTRUMENTATION_MESSAGING_RECEIVE_TELEMETRY_ENABLED
        value: "true"
      - name: OTEL_INSTRUMENTATION_MESSAGING_SEND_TELEMETRY_ENABLED
        value: "true"
스크린샷 2024-05-20 오전 10 41 45

I initially raised this issue with the OpenTelemetry team, but their response suggested raising the issue with Grafana and Tempo instead. Ref. https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/11348

To Reproduce Steps to reproduce the behavior:

  1. Deploy two Java Spring services, A and B, in Kubernetes using the OpenTelemetry Java auto-instrumentation agent with the following configuration:
    apiVersion: opentelemetry.io/v1alpha1
    kind: Instrumentation
    metadata:
    name: sample-instrumentation
    namespace: test
    spec:
    propagators:
    - b3
    sampler:
    type: parentbased_traceidratio
    argument: "0.1"
    java:
    image: "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.29.0"
    env:
      - name: OTEL_METRICS_EXPORTER
        value: "prometheus"
      - name: OTEL_METRICS_EXEMPLAR_FILTER
        value: "trace_based"
      - name: OTEL_TRACES_EXPORTER
        value: "zipkin"
      - name: OTEL_EXPORTER_ZIPKIN_ENDPOINT
        value: "http://SOME_TEST_OTELCOLLECTOR_ENDPOINT:9411/api/v2/spans"
      - name: OTEL_INSTRUMENTATION_HTTP_SERVER_CAPTURE_REQUEST_HEADERS
        value: "content-type"
      - name: OTEL_INSTRUMENTATION_HTTP_SERVER_CAPTURE_RESPONSE_HEADERS
        value: "content-type"
  2. Observe the trace data in Tempo (with Grafana)
  3. Notice that the trace data incorrectly indicate service A is making requests to service B.
  4. Check A->B in Grafana using the service graph or the traces_service_graph_request_total data.
  5. Creating and testing a topic that they consume from each other would be more accurate.

Expected behavior The trace data and metrics should accurately reflect the interactions between services. Specifically, no traces or metrics should suggest a direct interaction between service A and service B when there is none.

Environment:

Additional Context All services are deployed as pods in EKS. The issue persists even after verifying that there are no overlapping or contaminated headers and that Trace IDs are unique and correctly configured. The environment configuration for OpenTelemetry instrumentation includes settings for exporting to Prometheus and Zipkin, capturing content-type headers for HTTP requests and responses.

Service A JDK: Amazon Corretto 17 Spring: 2.7.1 OS: Amazon Linux (EKS)

Service B JDK: Amazon Corretto 17 Spring: 3.0.5 OS: Amazon Linux (EKS)

mapno commented 4 months ago

Hi! Service graphs have a number of ways of identifying communication between services—for Tempo they're described in the docs. Connections not necessarily need represent HTTP requests.

* A request across a messaging system where the outgoing and the incoming span must have `span.kind`, `producer`, and `consumer` respectively.

This is what's identifying a connection between the two services.

finda-yeongjo commented 4 months ago

Hey @mapno Your answer was fantastic. I have perfectly removed the problematic parts from the dashboard and various graphs using Tempo as a data source. I blame myself for not carefully reading the docs.

May I ask one more question? When specifying span_kind, there is no data (span_kind_consumer, producer, server, client and unspecified). Is there any additional configuration needed? Simply setting connection_type=messaging_system shows all servers communicating through MSK

스크린샷 2024-05-20 오후 6 24 14

I am using auto-instrumentation because I cannot enforce spans on all technical teams, which makes it difficult for me to directly control headers, span kinds, IDs and etc....

mapno commented 4 months ago

Hey! span_kind is not a label of service graph metrics (it's set on span-metrics though). I'm not sure if it'd make sense to add it in the first place, since it's implicit by the connection type—ie. if connection_type is messaging_system, the spans must have had kind consumer and producer.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.