Processed/exported SDK metrics

open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains

Apache License 2.0

288 stars 176 forks source link

Processed/exported SDK metrics #83

Open carlosalberto opened 1 year ago

carlosalberto commented 1 year ago

Opening this issue to mainly get the ball rolling, as I have had users asking for metrics around processed/dropped/exported data (starting with traces, but following up with metrics/logs). I'd like to initially add the following metrics (some inspiration take by the current metrics in the Java SDK):

otel.exporter.exported, counter, with attributes:
- success = true|false
- type = span|metric|log
- exporterType = <exporter type, e.g. GrpcSpanExporter>
otel.processor.processed, counter, with attributes:
- dropped = true|false (buffer overflow)
- type = span|metric|log
- processorType = <processor typ, e.g. BatchSpanProcessor>

Albeit this is mostly targeted at SDKs, the Collector could use this as well - in which case we may want to add a component or pipeline.component attribute (or similar), to signal whether this is a SDK or a Collector.

arminru commented 1 year ago

Do you intend to just introduce a semantic convention for this, or would this be added to the SDK specification (in https://github.com/open-telemetry/opentelemetry-specification) as well to ensure a consistent implementation? The relevant SDK spec parts are already stable, but this could be introduced as an optional feature.

jsuereth commented 1 year ago

+1 on semconv, also this walks into the "namespaced attributes" debate.

Oberon00 commented 1 year ago

The relevant SDK spec parts are already stable, but this could be introduced as an optional feature.

I don't think that "stable" is that restrictive, but I think this would be best made optional anyway

fbogsany commented 1 year ago

This is exceptionally useful. We added hooks to enable metrics capture in the Ruby SDK a couple of years ago: https://github.com/open-telemetry/opentelemetry-ruby/pull/510. The metrics we defined include:

otel.otlp_exporter.request_duration
otel.otlp_exporter.failure ("soft" failure - request will be retried)
otel.bsp.buffer_utilization (a snapshot of "fullness" of the BSP buffer)
otel.bsp.export.success
otel.bsp.export.failure (hard failure - request will not be retried)
otel.bsp.exported_spans
otel.bsp.dropped_spans

At Shopify, we find these metrics very useful for monitoring the health of our trace collection pipeline. We have added these metrics in various hacky ways to other language SDKs (e.g. Go). It would be great to standardize them across SDK implementations.

robertlaurin commented 1 year ago

The Ruby SDK also reports compressed and uncompressed sizes of the batch before exporting. We have found this to be a better indicator of load on our collection infrastructure than span volume alone. We often feel the pain of this missing from other SDK implementations where we have not hacked it in.

tiithansen commented 2 months ago

It would be nice if BSP would export following metrics:

otel.bsp.queue.capacity - Maximum size of queue (Gauge) otel.bsp.queue.size - Number of items in queue (Gauge) otel.bsp.queue.max_batch_size - Maximum size of a batch (Gauge) otel.bsp.queue.timeout - Timeout when batch is exported regardless of size (Gauge) otel.bsp.queue.exports - With labels reason=size|timeout (Counter)

Then its possible to build dashboards and alerts to detect problematic applications easily because its possible to compare size and capacity also its possible to see what triggers exports most timeouts or size hits.