Tempo service graphs don't work correctly on our example

grafana / beyla

eBPF-based autoinstrumentation of web applications and network metrics

https://grafana.com/oss/beyla-ebpf/

Apache License 2.0

1.45k stars 102 forks source link

Tempo service graphs don't work correctly on our example #239

Closed grcevski closed 7 months ago

grcevski commented 1 year ago

The NGINX example with 3 downstream services shows wrong tempo service graph. Instead of NGINX being the in between the User and the 3 upstream services, it generates as a sibling instead.

grcevski commented 1 year ago

After some digging this particular issue is caused by how Tempo parses client and server spans to generate the graphs. In absence of distributed tracing, the server and client spans are not going to match, so we have to rely on the PeerService field in the traces to connect the graph.

The problem is that the server spans never get a ClientService field set in the tempo graphs, so Tempo assigns "user" to the client, making the graph look like the above.

The only way to workaround the problem is to create a fake span that is not a server span, which makes Tempo skip the "user" assignment logic.

grcevski commented 7 months ago

This has to do with lack of context propagation in some cases, the focus should be in increasing the ability to context propagate.

sergeij commented 7 months ago

Testing Beyla on Kubernetes and seeing the same issue? Is there a workaround for this? Beyla docker image: 1.5.2 Opentelemetry collector version: 0.99.0 Grafana Tempo Distributed: 2.4.1

Beyla config:

  beyla-config.yml: |
    log_level: INFO
    print_traces: false
    attributes:
      kubernetes:
        enable: true
    routes:
      unmatched: heuristic
    prometheus_export:
      port: 8889
      path: /metrics
    internal_metrics:
      port: 8889
    ebpf:
      bfp_debug: true
    grafana:
      oltp:
        submit: ["metrics", "traces"]
    otel_traces_export:
      sampler:
        name: "parentbased_always_on"
    discovery:
      services:
      - exe_path: (apache2)|(node)

grcevski commented 7 months ago

At the moment we can propagate context between services well for Go, and between services on the same node if a single Beyla is monitoring all services on that node. We'll be improving this in the next couple of months.

brunocascio commented 1 month ago

@grcevski any updateds on this?

I'm using beyla running as daemonsets, but I'm not able to see the downstreams at all, only upstream services

sergeij commented 1 month ago

@brunocascio if you are using OpenTelemetry Collector, one workaround can be setting peer.service to server.address. Hackish, not perfect, but works

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
        http:
          endpoint: ${env:MY_POD_IP}:4318

  processors:
    batch:
      timeout: 10s
      send_batch_size: 1024
    transform:
      trace_statements:       
        - context: span
          statements:
            - set(attributes["peer.service"], attributes["server.address"]) where attributes["peer.service"] == nil

brunocascio commented 1 month ago

@brunocascio if you are using OpenTelemetry Collector, one workaround can be setting peer.service to server.address. Hackish, not perfect, but works

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
        http:
          endpoint: ${env:MY_POD_IP}:4318

  processors:
    batch:
      timeout: 10s
      send_batch_size: 1024
    transform:
      trace_statements:       
        - context: span
          statements:
            - set(attributes["peer.service"], attributes["server.address"]) where attributes["peer.service"] == nil

Thanks @sergeij ! I'm using OTEL but through Alloy, so I'will try this and come back soon!

Thanks!

grcevski commented 1 month ago

@grcevski any updateds on this?

I'm using beyla running as daemonsets, but I'm not able to see the downstreams at all, only upstream services

We are actively working on making this happen at the moment. We have some changes in main, but they are not enabled yet. More will land in the next couple of weeks.