Closed objectiser closed 4 years ago
A couple of initial comments:
1) Do we want source_format
to be more specific, e.g. jaeger-grpc
, jaeger-thrift-...
?
2) Receiver and exporter metrics don't seem to support the service
and source_format
labels, only the batches/spans received (associated with the processor - is that an issue?
Some naming issues need to be sorted out - e.g. otelcol_otelcol_receiver_...
metric name and otelsvc_receiver="jaeger-collector"
tag (i.e. consistent use of receiver vs source_format?).
I would prefer to create a google spreadsheet listing all metrics available in Jaeger components, and show how they map to otel metrics. GitHub ticket is not the best format for that analysis.
The general answer to the two questions above - yes, we want to keep the expressiveness of Jaeger metrics. All existing dimensions were added for a reason, especially being able to quantify different sources and format of inbound traffic is important for operating a prod cluster.
Needs more work, but initial mapping is here.
Many of the mappings are not clear at the moment, so will need to dig into the code a bit to see what they actually represent.
First draft of metrics comparison is now complete with comments that need discussion: https://docs.google.com/spreadsheets/d/1W6mGt3w47BlCdxVelnMbc_fL_HE_GC1CzO3Zu6-i83I/edit?usp=sharing
There are various issues with the metrics so want to tackle one specific set of metrics first - specifically the jaeger_agent_reporter_(batches|spans)_(submitted|failures)
.
The closest equivalent metrics produced by OTC currently otelcol_(success|fail)_send
are associated with the queued_retry
processor. Don't think this is an issue, as we would want the OTC exporter (when used in place of the agent reporter), to be backed by a retry/queuing mechanism.
Assuming that is not a problem, the issues are:
service
as a dimensionprotocol
- was thinking that the queued_retry
processor associated with the pipeline could be named to include the protocol, e.g. processor="queued_retry/jaeger_grpc"
(so this would be set in the otel collector config if the end user wanted to differentiate the metrics by protocol) - however may be redundant if the only Jaeger exporter protocol is grpc :)cc @jaegertracing/jaeger-maintainers
Doesn’t the protocol label in jaeger refer to the inbound span format?
@yurishkuro No, the reporter protocol was extracted from the metric name, to be a label, in 1.9.0.
yes, I was thinking of the receiver transport, that should be a different metric anyway.
@yurishkuro If those metrics seem ok for the agent reporter, I'll create some issues on the OTC repo to deal with the problems outlined?
@objectiser so there are a bunch of red cells in your spreadsheet. Some of them are specific to jaeger client/agent integrations, what are your thoughts on those? I assume we can keep them out of scope, since OTel SDKs may not even have the same mechanisms.
For clear misses, yes let's file tickets in OTel.
@yurishkuro The collector metrics I was going to deal with in a separate comment (probably next week) - wanted to start with the agent reporter ones. May also raise issue in OTC repo about an equivalent metric for jaeger_agent_reporter_batch_size
, which would complete the set.
Regarding the jaeger_thrift_udp....
metrics - wasn't sure about them - if they/some are relevant then otel equivalents could be added to the jaeger receiver?
Reported agent related metrics here: https://github.com/open-telemetry/opentelemetry-collector/issues/662
Adding example metrics recorded by Jaeger with hotrod:
And OTEL metrics receiving data via Jaeger thrift receiver and sending to Jaeger collector (agent mode): https://pastebin.com/X4n9uSJ8
OTEL metrics --legacy-metrics=false --new-metrics=true
https://pastebin.com/HRqGJDva
OTEL metrics --legacy-metrics=true --new-metrics=true
https://pastebin.com/ebfZ6YV9
Here is the set of new OTEL metrics:
Receiver metrics: accepter/refused
# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline.
otelcol_receiver_accepted_spans{receiver="jaeger",transport="agent"} 3
# HELP otelcol_receiver_refused_spans Number of spans that could not be pushed into the pipeline.
otelcol_receiver_refused_spans{receiver="jaeger",transport="agent"} 0
Exporter metrics: failed/sent
# HELP otelcol_exporter_send_failed_spans Number of spans in failed attempts to send to destination.
otelcol_exporter_send_failed_spans{exporter="jaeger"} 10
# HELP otelcol_exporter_sent_spans Number of spans successfully sent to destination.
otelcol_exporter_sent_spans{exporter="jaeger"} 3
Processor metrics: accepted spans/batches, dropped spans/batches, refused spans, queue length and latency, send fail, send latency, retry send
# HELP otelcol_processor_accepted_spans Number of spans successfully pushed into the next component in the pipeline.
otelcol_processor_accepted_spans{processor="queued_retry"} 3
# HELP otelcol_processor_batches_received The number of span batches received.
otelcol_processor_batches_received{processor="queued_retry"} 3
# HELP otelcol_processor_dropped_spans Number of spans that were dropped.
otelcol_processor_dropped_spans{processor="queued_retry"} 0
# HELP otelcol_processor_queued_retry_fail_send The number of failed send operations performed by queued_retry processor
otelcol_processor_queued_retry_fail_send{processor="queued_retry"} 10
# HELP otelcol_processor_queued_retry_queue_latency The "in queue" latency of the successful send operations.
otelcol_processor_queued_retry_queue_latency_bucket{processor="queued_retry",le="10"} 2
# HELP otelcol_processor_queued_retry_queue_length Current number of batches in the queue
otelcol_processor_queued_retry_queue_length{processor="queued_retry"} 0
# HELP otelcol_processor_queued_retry_send_latency The latency of the successful send operations.
otelcol_processor_queued_retry_send_latency_bucket{processor="queued_retry",le="10"} 3
# HELP otelcol_processor_queued_retry_success_send The number of successful send operations performed by queued_retry processor
otelcol_processor_queued_retry_success_send{processor="queued_retry"} 3
# HELP otelcol_processor_refused_spans Number of spans that were rejected by the next component in the pipeline.
otelcol_processor_refused_spans{processor="queued_retry"} 0
# HELP otelcol_processor_spans_dropped The number of spans dropped.
otelcol_processor_spans_dropped{processor="queued_retry"} 0
# HELP otelcol_processor_spans_received The number of spans received.
otelcol_processor_spans_received{processor="queued_retry"} 3
# HELP otelcol_processor_trace_batches_dropped The number of span batches dropped.
otelcol_processor_trace_batches_dropped{processor="queued_retry"} 0
I have added a second tab to @objectiser doc - https://docs.google.com/spreadsheets/d/1W6mGt3w47BlCdxVelnMbc_fL_HE_GC1CzO3Zu6-i83I/edit?usp=sharing. It contains a similar comparison probably with more details.
Here are my findings: The OTEL metrics look good, there is good coverage for all components. However Jaeger provides better visibility into which services are reporting spans, this is completely missing in OTEL.
We should address these things:
spans_received
split by debug,format,service,transport
. The transport we already have, the format is not needed as we use only a single format per transport, but we need service and maybe debug? cc) @yurishkuro https://github.com/open-telemetry/opentelemetry-collector/issues/857spans_saved_by_svc
split by debug, service, result. agent
, this could be changed to aget_compact
and agent_binary
. open-telemetry/opentelemetry-collector#859I was looking at 4. I could not find a way to distinguish between binary and compact in the agent's EmitBatch
method
Jaeger exposes spans_received split by debug,format,service,transport. The transport we already have, the format is not needed as we use only a single format per transport.
But OTel collector accepts ever more formats than Jaeger, why is format not needed?
The receiver metrics are split by receiver type and transport. The idea here is that transport supports only a single format.
# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline.
otelcol_receiver_accepted_spans{receiver="jaeger",transport="agent"} 3
transport="agent"
is weird, we have udp vs. grpc, the actual transports
I think I might have a way how to split it into two values. What about udp_thrift_compact
, udp_thrif_binary
?
that would be good & sufficient.
Zipkin receiver has the same problem, the dimension is only http
but it can be http_json_v1
http_json_v2
, http_thrift_v1
, http_proto
.
PR to fix the Zipkin metrics https://github.com/open-telemetry/opentelemetry-collector/pull/867
+1
The remaining items here are:
Using the following OpenTelemetry collector config (with image built from master):
and using the business-application.yaml to create some test requests, it resulted in the following metrics: