Open carlosalberto opened 1 year ago
Do you intend to just introduce a semantic convention for this, or would this be added to the SDK specification (in https://github.com/open-telemetry/opentelemetry-specification) as well to ensure a consistent implementation? The relevant SDK spec parts are already stable, but this could be introduced as an optional feature.
+1 on semconv, also this walks into the "namespaced attributes" debate.
The relevant SDK spec parts are already stable, but this could be introduced as an optional feature.
I don't think that "stable" is that restrictive, but I think this would be best made optional anyway
This is exceptionally useful. We added hooks to enable metrics capture in the Ruby SDK a couple of years ago: https://github.com/open-telemetry/opentelemetry-ruby/pull/510. The metrics we defined include:
otel.otlp_exporter.request_duration
otel.otlp_exporter.failure
("soft" failure - request will be retried)otel.bsp.buffer_utilization
(a snapshot of "fullness" of the BSP buffer)otel.bsp.export.success
otel.bsp.export.failure
(hard failure - request will not be retried)otel.bsp.exported_spans
otel.bsp.dropped_spans
At Shopify, we find these metrics very useful for monitoring the health of our trace collection pipeline. We have added these metrics in various hacky ways to other language SDKs (e.g. Go). It would be great to standardize them across SDK implementations.
The Ruby SDK also reports compressed and uncompressed sizes of the batch before exporting. We have found this to be a better indicator of load on our collection infrastructure than span volume alone. We often feel the pain of this missing from other SDK implementations where we have not hacked it in.
It would be nice if BSP would export following metrics:
otel.bsp.queue.capacity
- Maximum size of queue (Gauge)
otel.bsp.queue.size
- Number of items in queue (Gauge)
otel.bsp.queue.max_batch_size
- Maximum size of a batch (Gauge)
otel.bsp.queue.timeout
- Timeout when batch is exported regardless of size (Gauge)
otel.bsp.queue.exports
- With labels reason=size|timeout
(Counter)
Then its possible to build dashboards and alerts to detect problematic applications easily because its possible to compare size and capacity also its possible to see what triggers exports most timeouts or size hits.
Opening this issue to mainly get the ball rolling, as I have had users asking for metrics around processed/dropped/exported data (starting with traces, but following up with metrics/logs). I'd like to initially add the following metrics (some inspiration take by the current metrics in the Java SDK):
otel.exporter.exported
, counter, with attributes:success
= true|falsetype
= span|metric|logexporterType
= <exporter type, e.g.GrpcSpanExporter
>otel.processor.processed
, counter, with attributes:dropped
= true|false (buffer overflow)type
= span|metric|logprocessorType
= <processor typ, e.g.BatchSpanProcessor
>Albeit this is mostly targeted at SDKs, the Collector could use this as well - in which case we may want to add a
component
orpipeline.component
attribute (or similar), to signal whether this is a SDK or a Collector.