grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.96k stars 512 forks source link

OTel Collector compatibility of the metrics-generator #2970

Open schewara opened 1 year ago

schewara commented 1 year ago

Is your feature request related to a problem? Please describe.

While trying to test the integration of the Span Metrics Connector I found that there are some compatibility issues between the OTel spanmetricsconnector and the Tempo spanmetrics processor (I didn't have look at the Grafana Agent)

  1. The metric names in Tempo seem to differ from the "OTel Semantic Conventions" (v1.21.0) [namespace_]duration_milliseconds_bucket vs. traces_spanmetrics_latency_bucket From inspecting the Semantic Conventions for HTTP Metrics as well as other Metrics (e.g. Promtail, Loki, OTel Automatic Instrumentations, ...) duration seems to be the correct and most commonly used name

  2. The OTel spanmetricconnector allows to define a namespace for the generated metrics. Grafana and Tempo use a hardcoded traces_spanmetrics namespace/prefix, which seems, can not be changed.

Describe the solution you'd like

It would be really great if the OTel Connectors would also work seamlessly with Grafana and Tempo (including the OTel Service Graph Connector), to allow frictionless migration between different components, based on the individual use-case.

Namespace support would also be nice, but could be addressed alternatively with a default namespace in the Connector, or a note in the documentation.

Describe alternatives you've considered

To have Span Metrics and Service Graphs working with Grafana, the only viable option seems to use Tempo's Metrics-generator, or use a Processor in the Collector to rename the metrics to be compatible with what Grafana needs.

Additional context

joe-elliott commented 1 year ago

Thanks for the issue!

(I didn't have look at the Grafana Agent)

Grafana agent vendors OTel components directly so it will be in line with the OTel Collector's behavior.

It would be really great if the OTel Connectors would also work seamlessly with Grafana and Tempo

Yes!

Currently we do have a lot of configuration options that allow the user to control the shape of the output metrics. I think a good path here is to make sure that Tempo span metrics can be configured to look like OTel Collector metrics by adding any required configuration. Then we can provide a some example configurations to make the two equivalent.

It is unfortunately not simple to just change the default metric names since operators have built dashboards/alerts/etc. on top of them. This could be a very costly breaking change to some of our users.

Another issue in play is that Grafana has some custom experiences built around these metrics in the Tempo Explore pane. So even if the user were able to adjust their config so Tempo metrics looked like OTel there would still be this gap where it would break some functionality in Grafana. Going to cc @grafana/observability-traces-and-profiling for thoughts.

Also, @rlankfo has done some work in this area on our side and would love to have his input.

aocenas commented 1 year ago

Yeah we sort of assume the names of the metrics in all the queries that use them from tempo side. We could make that configurable, that is not hard, although that is another layer of configuration which makes it harder for the user.

I wonder if we could somehow autodetect the naming, if there is a reasonable pattern with just a few options like (namespace_)?(duration_milliseconds_bucket|traces_spanmetrics_latency_bucket) maybe we can just run some discovery query during configuration to set this up automatically.

rlankfo commented 1 year ago

When generating metrics in Tempo, it's possible to use relabeling during remote write. This would allow you to do things like rename metrics, drop labels, etc. You should technically be able to align your metric names with semantic conventions in this way.

Here's an example of a rename and label drop:

metrics_generator:
  registry:
    external_labels:
      source: "tempo"
  storage:
    path: "/tmp/tempo/generator/wal"
    remote_write:
      - url: "${MIMIR_URL}/api/v1/push"
        send_exemplars: true
        write_relabel_configs:
          - source_labels: ["__name__", "connection_type"]
            target_label: "__name__"
            separator: "@"
            regex: "traces_service_graph_request_client_(.*)@database"
            replacement: 'db_client_duration_$1'
          - regex: "connection_type"
            action: "labeldrop"

In this example, I'm renaming traces_service_graph_request_client_(.*) metrics to db_client_duration_$1 if the connection_type is database. Additionally, the connection_type label is dropped.

This is a good article on how relabeling in prometheus works: https://grafana.com/blog/2022/03/21/how-relabeling-in-prometheus-works/ Here's the official documentation: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write

I hope this helps!

joe-elliott commented 1 year ago

I wonder if we could somehow autodetect the naming, if there is a reasonable pattern with just a few options like (namespace_)?(duration_milliseconds_bucket|traces_spanmetrics_latency_bucket) maybe we can just run some discovery query during configuration to set this up automatically.

@aocenas This might be a nice feature to add the list. If we can support Tempo and OTEL then we'd also get Grafana Agent (since they create OTEL metrics). Honestly, you might not even need to autodetect anything. We might be able to write some clever PromQL queries that sum both values up.

Thanks for the example @rlankfo !

schewara commented 1 year ago

After some further testing, I came across some more findings, which are not fully related to the compatibility, as they also affect the specific OTel Connectors, but wanted to add them here for more context.

Due to these reasons I am currently testing an alternative approach to create a Node Graph Panel in Grafana based on the existing client metrics with some promql and Grafana transformation magic, but still have to wrap my head around it.