jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.51k stars 2.44k forks source link

[Bug]: Empty metrics in the monitoring tab #4035

Closed marianobrc closed 11 months ago

marianobrc commented 2 years ago

What happened?

I'm trying the experimental SPM features, using the all-in-one docker image and docker-compose with Prometheus as described in the docs. And I want to see the latency and error rate metrics for my services, but I only see empty metrics.

Steps to reproduce

I'm sending traces both with the simulator and from my own services instrumented with OpenTelemetry and both are collected and shown properly. But in the monitoring tab I can only see metrics for the data generated by the simulator, but not for my services. I see my services in the dropdown but when I select one it says that no data is available.

I checked in prometheus and the metrics are there, and I can also visualize it with Graphana, but still can't see it in the jaeger monitoring tab. When I call teh http API or if I inspect the requests being made by the frontend I can see the metrics api returns an empty list for my services.

Expected behavior

I'm wondering if there is a minimal amount of traces required to start getting metrics or I'm missing something?

Relevant log output

No response

Screenshot

No response

Additional context

No response

Jaeger backend version

v1.39.0

SDK

Open Telemetry Python SDK using OTLPSpanExporter

Pipeline

OTLPSpanExporter -> OTel Collector -> Prometheus -> Jaeger All-in-one

Stogage backend

Prometheus

Operating system

Ubuntu

Deployment model

docker-compose

Deployment configs

version: "3.5"
services:
  jaeger:
    networks:
      - backend
    image: jaegertracing/all-in-one:1.39
    volumes:
      - "./jaeger-ui.json:/etc/jaeger/jaeger-ui.json"
    command: --query.ui-config /etc/jaeger/jaeger-ui.json
    environment:
      - METRICS_STORAGE_TYPE=prometheus
      - PROMETHEUS_SERVER_URL=http://prometheus:9090
    ports:
      - "17686:16686"
      - "17687:16687"
  otel_collector:
    networks:
      - backend
    image: otel/opentelemetry-collector-contrib:latest
    volumes:
      - "./otel-collector-config.yml:/etc/otelcol/otel-collector-config.yml"
    command: --config /etc/otelcol/otel-collector-config.yml
    ports:
      - "4417:4317"
      - "4418:4318"
    depends_on:
      - jaeger
  microsim:
    networks:
      - backend
    image: yurishkuro/microsim:0.2.0
    command: "-j http://otel_collector:14278/api/traces -d 24h -s 500ms"
    depends_on:
      - otel_collector
  prometheus:
    networks:
      - backend
    image: prom/prometheus:latest
    volumes:
      - "./prometheus.yml:/etc/prometheus/prometheus.yml"
    ports:
      - "9090:9090"
  grafana:
    networks:
      - backend
    image: grafana/grafana:latest
    volumes:
      - ./grafana.ini:/etc/grafana/grafana.ini
      - ./datasource.yml:/etc/grafana/provisioning/datasources/datasource.yaml
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_AUTH_DISABLE_LOGIN_FORM=true
    ports:
      - 3000:3000

networks:
  backend:
albertteoh commented 2 years ago

Thanks for the details @marianobrc.

if there is a minimal amount of traces required to start getting metrics

No, there shouldn't be, just a couple are enough to start showing some metrics in the Monitor tab.

What span kind is your python application emitting? If it's anything but the server kind, then that could explain why you're not able to view metrics from your application. See also: https://github.com/jaegertracing/jaeger/pull/3898.

There's a otlp_exporter_example.py script in the demo environment that should produce metrics in the Monitor tab for you (might need to wait a few seconds after sending a trace)? Maybe you can use this as an example to compare with re: instrumenting your Python application.

marianobrc commented 2 years ago

Thanks for you prompt response @albertteoh .

My application uses spans of kind producer and consumer, not server indeed. I'm tracing a distributed transactions in an event-driven architecture using kafka as the message broker for inter-service communication. The goal is to get metrics like the total latency (duration) of the transaction trace. I can see the total duration of each individual trace related to one of this transactions in the trace details, and I was hoping to see come latency metric from that trace duration in the monitor tab.

I tried changing the span kind to server and I see metrics now, but it doesn't feel like the right solution. It doesn't make sense to force a span within a producer or consumer to be of server kind just to get the metrics. Also, I see that the latency metric is calculated at the spans level instead at trace level, so I can see at most the latency of some span within one service. I guess it makes sense as this is for "Service Monitoring", so my use case isn't supported.

albertteoh commented 2 years ago

From what I gather, I think there are two problems that you're currently facing:

  1. Limitation on span kind.
  2. View trace of a transaction from an event-driven architecture.

1. Limitation on span kind

I agree that forcing the server span kind into your producer/consumer applications isn't the right approach.

In this case, I think it's just a limitation of the Jaeger UI at the moment, which currently hardcodes the kind to server (for expediency at the time of development). The Jaeger Query API itself supports querying for any span kind. I'm not sure, however, if it's a good idea to aggregate across all span kinds as that would "double" count the metrics if, say, a service is both a server and a producer.

Would https://github.com/jaegertracing/jaeger/pull/3898#issuecomment-1305157851 address your need to view metrics from producer/consumer spans (i.e. a dropdown in Jaeger UI's Monitor tab to select the span kind, and perhaps cache that selection per service)?

2. View trace of a transaction from an event-driven architecture

SPM was indeed designed to provide service-level monitoring, and so is completely oblivious of the concept of traces, which has the added benefit of simplifying the design.

If I understand your requirement correctly, you'd like to view the latency of handling a single transaction in an event-driven architecture.

I don't have any experience in this space and, in SPM, I would usually suggest to go to the "root" service to find this information. However, my naïve understanding is that this may not be possible for event-driven architectures as the producer would simply return after sending the message payload to Kafka, and the request is asynchronously consumed so the "root" span doesn't encompass the entire handling of the transaction. What you'd essentially want is a way to aggregate the latencies across all spans in a trace.

If the above is correct, I agree that SPM would not support the use case of measuring the latency of a single transaction. However, I'm definitely open to ideas/suggestions!