open-telemetry / opentelemetry.io

The OpenTelemetry website and documentation
https://opentelemetry.io
Creative Commons Attribution 4.0 International
540 stars 1.19k forks source link

Document custom Kafka client metrics on otel.io #4063

Open lmolkova opened 11 months ago

lmolkova commented 11 months ago

Context

Effectively, Kafka project does not plan to follow OTel semconv (@AndrewJSchofield to confirm)

OTel provides several kafka instrumentation components:

  1. Some of them are based on monkey-patching/byte-code rewriting and can emit otel-compatible metrics and traces
  2. Others (such as collector kafkareceiver or .NET Aspire Kafka integration library) scrape metrics that broker provide

The problem:

Group 1 (monkey-patched instrumentations) might still want to emit kafka-specific metrics/traces. We'll need to keep them in otel-semconv repo so they are consistent across languages/clients.

Group 2 (instrumentations that report what's available) have more difficult problems: There are multiple ways to scrape different sets of metrics from Kafka:

  1. Java uses Kafka JMX metrics
  2. collector uses Kafka client library APIs to get stats
  3. .NET Aspire component uses metrics available through underlying librdkafka
  4. Once KIP-714 is implemented in different langauges, there will be yet another way

These metrics in most cases can't be converted to OTel ones (use different instruments, don't support histograms, don't report the same attributes, etc).

As a result, we're going to end up with each language SIG (plus external components) defining their own set of custom metrics for Kafka based on what they have.

What we can do on otel semconv side:

lmolkova commented 11 months ago

Related: https://github.com/open-telemetry/semantic-conventions/pull/338

pyohannes commented 11 months ago
  • recommend a default way to scrape metrics from kafka

For broker metrics, this could be the receiver implemented for the collector, it already maintains a list of supported metrics.

For client metrics, Kafka takes an approach similar to Kubernetes:

I don't think OTel should start to tackle the problems that arise from this, even more so as it's not fully implemented and working yet, and many details are still unclear.

As we now have generic messaging metrics defined (albeit experimental), we should rather build on those where possible. Which means, seeing whether we can map to those metrics, and define Kafka-specific extensions where needed.

lmolkova commented 11 months ago

@pyohannes the suggestion here is not to solve a big problem but reduce inconsistency for non-standard set of metrics so different instrumentations emit similar things.

For example just document them once like Java does for kafka library - https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/bbfe950ad0ace8123a5e6817fb3767e27a1a2cee/instrumentation/kafka/kafka-clients/kafka-clients-2.6/library/README.md.tTis documentation can be an informative one and live on opentelemetry.io

jack-berg commented 11 months ago

Some of them are based on monkey-patching/byte-code rewriting and can emit otel-compatible metrics and traces

Just a small clarification. The java kafka client metrics solution works by bridging metrics from the kafka client library using API hooks provided by the library. No monkey patching / bytecode rewriting required. The code performs a minimal generic mapping between the instrument types, metric names, and attribute names. It does not cherry pick metrics from kafka and try to conform to any particular conventions - that approach was ruled out because it was too brittle and too time consuming given the sheer number of instruments exposed (> 200 IIRC).

pyohannes commented 11 months ago

For example just document them once like Java does for kafka library - [...]

That's fine for me, as long as what we document doesn't conflict with the generic messaging metrics that we have.

lmolkova commented 11 months ago

discussed at Semconv WG meeting on 12/4.

  1. Kafka owns 'official' metrics emitted by Kafka clients whatever they are
  2. OTel instrumentation libraries/components that emit other metrics should strive for consistency with each other whenever possible.

Next steps:

joaopgrassi commented 8 months ago

Given we removed them from conventions here https://github.com/open-telemetry/semantic-conventions/pull/338, do we really need to do anything here?

lmolkova commented 8 months ago

Given we removed them from conventions here open-telemetry/semantic-conventions#338, do we really need to do anything here?

@joaopgrassi we still want to document them and the semconv WG decision was to have an informative section on opentelemetry.io, so I transferred issue

svrnm commented 8 months ago

Thanks for transferring this issue @lmolkova . Since this is a first instance of something like that being documented, we need to figure out where and how to put this within the docs. To be honest right now I am not sure what the best place will be, any suggestions?

lmolkova commented 8 months ago

@svrnm I wonder if we can add a page under Semantic Conventions, something like "External conventions" where we would be able to provide documentation about non-otel-authored/compliant signals Otel collector/instrumentation libraries emit.

E.g.

Semantic Conventions
    External Conventions
        Kafka

I believe there are more candidates to be in that folder (looking into collector receivers, there are plenty of scrapers (Redis, RabbitMQ, ...) that don't always document metrics. Ideally, we want them to at least add a link to external documentation.

As an alternative, we could consider adding a section under "Collector" since most of this external conventions will come through it and then, in rare cases they are needed outside of the collector (like in java-instrumentation), we could just link the section in the Collector.

WDYT?

austinlparker commented 8 months ago

Why aren't they following semconv?

cartermp commented 8 months ago

I would prefer this:

add a page under Semantic Conventions, something like "External conventions"

Since it's consistent with where we keep naming for common components.

lmolkova commented 8 months ago

Why aren't they following semconv?

Kafka specific ones we want to find home for are legacy ones from pre-otel world (which Kafka owners AFAIK want to preserve for the time being).

austinlparker commented 8 months ago

Why aren't they following semconv?

Kafka specific ones we want to find home for are legacy ones from pre-otel world (which Kafka owners AFAIK want to preserve for the time being).

Ah, ok.

svrnm commented 8 months ago

Kafka specific ones we want to find home for are legacy ones from pre-otel world (which Kafka owners AFAIK want to preserve for the time being).

Is there a discussion that we can reference for that? Or, asked differently: have we (opentelemetry community) actively engaged in a conversation with them (kafka community) if this is they right way forward? Not that we can tell them what to do but we can at least help (if wanted) to make an inform decision