jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.52k stars 2.44k forks source link

Are the current set of collector metrics adequate? #2165

Closed objectiser closed 4 years ago

objectiser commented 4 years ago

Using the following OpenTelemetry collector config (with image built from master):

    receivers:
      jaeger:
        protocols:
          grpc:
            endpoint: "localhost:14250"

    processors:
      queued_retry:

    exporters:
      logging:

    service:
      pipelines:
        traces:
          receivers: [jaeger]
          processors: [queued_retry]
          exporters: [logging]

and using the business-application.yaml to create some test requests, it resulted in the following metrics:

# HELP otelcol_batches_dropped The number of span batches dropped.
# TYPE otelcol_batches_dropped counter
otelcol_batches_dropped{processor="",service="",source_format=""} 0
# HELP otelcol_batches_received The number of span batches received.
# TYPE otelcol_batches_received counter
otelcol_batches_received{processor="queued_retry",service="inventory",source_format="jaeger"} 9
otelcol_batches_received{processor="queued_retry",service="order",source_format="jaeger"} 9
# HELP otelcol_oc_io_process_cpu_seconds CPU seconds for this process
# TYPE otelcol_oc_io_process_cpu_seconds gauge
otelcol_oc_io_process_cpu_seconds 0
# HELP otelcol_oc_io_process_memory_alloc Number of bytes currently allocated in use
# TYPE otelcol_oc_io_process_memory_alloc gauge
otelcol_oc_io_process_memory_alloc 4.582904e+06
# HELP otelcol_oc_io_process_sys_memory_alloc Number of bytes given to the process to use in total
# TYPE otelcol_oc_io_process_sys_memory_alloc gauge
otelcol_oc_io_process_sys_memory_alloc 7.25486e+07
# HELP otelcol_oc_io_process_total_memory_alloc Number of allocations in total
# TYPE otelcol_oc_io_process_total_memory_alloc gauge
otelcol_oc_io_process_total_memory_alloc 6.415736e+06
# HELP otelcol_otelcol_exporter_dropped_spans Counts the number of spans received by the exporter
# TYPE otelcol_otelcol_exporter_dropped_spans counter
otelcol_otelcol_exporter_dropped_spans{otelsvc_exporter="logging",otelsvc_receiver=""} 0
# HELP otelcol_otelcol_exporter_received_spans Counts the number of spans received by the exporter
# TYPE otelcol_otelcol_exporter_received_spans counter
otelcol_otelcol_exporter_received_spans{otelsvc_exporter="logging",otelsvc_receiver=""} 252
# HELP otelcol_otelcol_receiver_dropped_spans Counts the number of spans dropped by the receiver
# TYPE otelcol_otelcol_receiver_dropped_spans counter
otelcol_otelcol_receiver_dropped_spans{otelsvc_receiver="jaeger-collector"} 0
# HELP otelcol_otelcol_receiver_received_spans Counts the number of spans received by the receiver
# TYPE otelcol_otelcol_receiver_received_spans counter
otelcol_otelcol_receiver_received_spans{otelsvc_receiver="jaeger-collector"} 252
# HELP otelcol_queue_latency The "in queue" latency of the successful send operations.
# TYPE otelcol_queue_latency histogram
otelcol_queue_latency_bucket{processor="queued_retry",le="10"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="25"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="50"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="75"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="100"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="250"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="500"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="750"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="1000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="2000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="3000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="4000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="5000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="10000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="20000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="30000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="50000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="+Inf"} 18
otelcol_queue_latency_sum{processor="queued_retry"} 0
otelcol_queue_latency_count{processor="queued_retry"} 18
# HELP otelcol_queue_length Current number of batches in the queued exporter
# TYPE otelcol_queue_length gauge
otelcol_queue_length{processor="queued_retry"} 0
# HELP otelcol_send_latency The latency of the successful send operations.
# TYPE otelcol_send_latency histogram
otelcol_send_latency_bucket{processor="queued_retry",le="10"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="25"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="50"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="75"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="100"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="250"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="500"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="750"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="1000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="2000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="3000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="4000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="5000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="10000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="20000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="30000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="50000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="+Inf"} 18
otelcol_send_latency_sum{processor="queued_retry"} 0
otelcol_send_latency_count{processor="queued_retry"} 18
# HELP otelcol_spans_dropped The number of spans dropped.
# TYPE otelcol_spans_dropped counter
otelcol_spans_dropped{processor="",service="",source_format=""} 0
# HELP otelcol_spans_received The number of spans received.
# TYPE otelcol_spans_received counter
otelcol_spans_received{processor="queued_retry",service="inventory",source_format="jaeger"} 112
otelcol_spans_received{processor="queued_retry",service="order",source_format="jaeger"} 140
# HELP otelcol_success_send The number of successful send operations performed by queued exporter
# TYPE otelcol_success_send counter
otelcol_success_send{processor="queued_retry",service="inventory",source_format="jaeger"} 9
otelcol_success_send{processor="queued_retry",service="order",source_format="jaeger"} 9
objectiser commented 4 years ago

A couple of initial comments: 1) Do we want source_format to be more specific, e.g. jaeger-grpc, jaeger-thrift-...? 2) Receiver and exporter metrics don't seem to support the service and source_format labels, only the batches/spans received (associated with the processor - is that an issue?

Some naming issues need to be sorted out - e.g. otelcol_otelcol_receiver_... metric name and otelsvc_receiver="jaeger-collector" tag (i.e. consistent use of receiver vs source_format?).

yurishkuro commented 4 years ago

I would prefer to create a google spreadsheet listing all metrics available in Jaeger components, and show how they map to otel metrics. GitHub ticket is not the best format for that analysis.

The general answer to the two questions above - yes, we want to keep the expressiveness of Jaeger metrics. All existing dimensions were added for a reason, especially being able to quantify different sources and format of inbound traffic is important for operating a prod cluster.

objectiser commented 4 years ago

Needs more work, but initial mapping is here.

Many of the mappings are not clear at the moment, so will need to dig into the code a bit to see what they actually represent.

objectiser commented 4 years ago

First draft of metrics comparison is now complete with comments that need discussion: https://docs.google.com/spreadsheets/d/1W6mGt3w47BlCdxVelnMbc_fL_HE_GC1CzO3Zu6-i83I/edit?usp=sharing

objectiser commented 4 years ago

There are various issues with the metrics so want to tackle one specific set of metrics first - specifically the jaeger_agent_reporter_(batches|spans)_(submitted|failures).

The closest equivalent metrics produced by OTC currently otelcol_(success|fail)_send are associated with the queued_retry processor. Don't think this is an issue, as we would want the OTC exporter (when used in place of the agent reporter), to be backed by a retry/queuing mechanism.

Assuming that is not a problem, the issues are:

cc @jaegertracing/jaeger-maintainers

yurishkuro commented 4 years ago

Doesn’t the protocol label in jaeger refer to the inbound span format?

objectiser commented 4 years ago

@yurishkuro No, the reporter protocol was extracted from the metric name, to be a label, in 1.9.0.

yurishkuro commented 4 years ago

yes, I was thinking of the receiver transport, that should be a different metric anyway.

objectiser commented 4 years ago

@yurishkuro If those metrics seem ok for the agent reporter, I'll create some issues on the OTC repo to deal with the problems outlined?

yurishkuro commented 4 years ago

@objectiser so there are a bunch of red cells in your spreadsheet. Some of them are specific to jaeger client/agent integrations, what are your thoughts on those? I assume we can keep them out of scope, since OTel SDKs may not even have the same mechanisms.

For clear misses, yes let's file tickets in OTel.

objectiser commented 4 years ago

@yurishkuro The collector metrics I was going to deal with in a separate comment (probably next week) - wanted to start with the agent reporter ones. May also raise issue in OTC repo about an equivalent metric for jaeger_agent_reporter_batch_size, which would complete the set.

Regarding the jaeger_thrift_udp.... metrics - wasn't sure about them - if they/some are relevant then otel equivalents could be added to the jaeger receiver?

objectiser commented 4 years ago

Reported agent related metrics here: https://github.com/open-telemetry/opentelemetry-collector/issues/662

pavolloffay commented 4 years ago

Adding example metrics recorded by Jaeger with hotrod:

And OTEL metrics receiving data via Jaeger thrift receiver and sending to Jaeger collector (agent mode): https://pastebin.com/X4n9uSJ8

OTEL metrics --legacy-metrics=false --new-metrics=true https://pastebin.com/HRqGJDva OTEL metrics --legacy-metrics=true --new-metrics=true https://pastebin.com/ebfZ6YV9

pavolloffay commented 4 years ago

Here is the set of new OTEL metrics:

Receiver metrics: accepter/refused

# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline.
otelcol_receiver_accepted_spans{receiver="jaeger",transport="agent"} 3
# HELP otelcol_receiver_refused_spans Number of spans that could not be pushed into the pipeline.
otelcol_receiver_refused_spans{receiver="jaeger",transport="agent"} 0

Exporter metrics: failed/sent

# HELP otelcol_exporter_send_failed_spans Number of spans in failed attempts to send to destination.
otelcol_exporter_send_failed_spans{exporter="jaeger"} 10
# HELP otelcol_exporter_sent_spans Number of spans successfully sent to destination.
otelcol_exporter_sent_spans{exporter="jaeger"} 3

Processor metrics: accepted spans/batches, dropped spans/batches, refused spans, queue length and latency, send fail, send latency, retry send

# HELP otelcol_processor_accepted_spans Number of spans successfully pushed into the next component in the pipeline.
otelcol_processor_accepted_spans{processor="queued_retry"} 3
# HELP otelcol_processor_batches_received The number of span batches received.
otelcol_processor_batches_received{processor="queued_retry"} 3
# HELP otelcol_processor_dropped_spans Number of spans that were dropped.
otelcol_processor_dropped_spans{processor="queued_retry"} 0
# HELP otelcol_processor_queued_retry_fail_send The number of failed send operations performed by queued_retry processor
otelcol_processor_queued_retry_fail_send{processor="queued_retry"} 10
# HELP otelcol_processor_queued_retry_queue_latency The "in queue" latency of the successful send operations.
otelcol_processor_queued_retry_queue_latency_bucket{processor="queued_retry",le="10"} 2
# HELP otelcol_processor_queued_retry_queue_length Current number of batches in the queue
otelcol_processor_queued_retry_queue_length{processor="queued_retry"} 0
# HELP otelcol_processor_queued_retry_send_latency The latency of the successful send operations.
otelcol_processor_queued_retry_send_latency_bucket{processor="queued_retry",le="10"} 3
# HELP otelcol_processor_queued_retry_success_send The number of successful send operations performed by queued_retry processor
otelcol_processor_queued_retry_success_send{processor="queued_retry"} 3
# HELP otelcol_processor_refused_spans Number of spans that were rejected by the next component in the pipeline.
otelcol_processor_refused_spans{processor="queued_retry"} 0
# HELP otelcol_processor_spans_dropped The number of spans dropped.
otelcol_processor_spans_dropped{processor="queued_retry"} 0
# HELP otelcol_processor_spans_received The number of spans received.
otelcol_processor_spans_received{processor="queued_retry"} 3
# HELP otelcol_processor_trace_batches_dropped The number of span batches dropped.
otelcol_processor_trace_batches_dropped{processor="queued_retry"} 0
pavolloffay commented 4 years ago

I have added a second tab to @objectiser doc - https://docs.google.com/spreadsheets/d/1W6mGt3w47BlCdxVelnMbc_fL_HE_GC1CzO3Zu6-i83I/edit?usp=sharing. It contains a similar comparison probably with more details.

Here are my findings: The OTEL metrics look good, there is good coverage for all components. However Jaeger provides better visibility into which services are reporting spans, this is completely missing in OTEL.

We should address these things:

  1. split receiver metrics by service. Jaeger exposes spans_received split by debug,format,service,transport. The transport we already have, the format is not needed as we use only a single format per transport, but we need service and maybe debug? cc) @yurishkuro https://github.com/open-telemetry/opentelemetry-collector/issues/857
  2. split storage metrics by service. Jaeger exposes spans_saved_by_svc split by debug, service, result.
  3. span average size - exposed in the receiver and also at the exporter (storage) because the size can change. https://github.com/open-telemetry/opentelemetry-collector/issues/856
  4. Make transport in receiver metrics more precise. For instance our agent exposes two endpoints but it's labeled always with agent, this could be changed to aget_compact and agent_binary. open-telemetry/opentelemetry-collector#859
pavolloffay commented 4 years ago

I was looking at 4. I could not find a way to distinguish between binary and compact in the agent's EmitBatch method

https://github.com/open-telemetry/opentelemetry-collector/blob/9f0f8e4b4ea368f68458e11a0cae2450a971e8d2/receiver/jaegerreceiver/trace_receiver.go#L317

https://github.com/open-telemetry/opentelemetry-collector/blob/9f0f8e4b4ea368f68458e11a0cae2450a971e8d2/receiver/jaegerreceiver/trace_receiver.go#L109

yurishkuro commented 4 years ago

Jaeger exposes spans_received split by debug,format,service,transport. The transport we already have, the format is not needed as we use only a single format per transport.

But OTel collector accepts ever more formats than Jaeger, why is format not needed?

pavolloffay commented 4 years ago

The receiver metrics are split by receiver type and transport. The idea here is that transport supports only a single format.

# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline.
otelcol_receiver_accepted_spans{receiver="jaeger",transport="agent"} 3
yurishkuro commented 4 years ago

transport="agent" is weird, we have udp vs. grpc, the actual transports

pavolloffay commented 4 years ago

I think I might have a way how to split it into two values. What about udp_thrift_compact, udp_thrif_binary?

yurishkuro commented 4 years ago

that would be good & sufficient.

pavolloffay commented 4 years ago

Here is the PR https://github.com/open-telemetry/opentelemetry-collector/pull/859

pavolloffay commented 4 years ago

Zipkin receiver has the same problem, the dimension is only http but it can be http_json_v1 http_json_v2, http_thrift_v1, http_proto.

PR to fix the Zipkin metrics https://github.com/open-telemetry/opentelemetry-collector/pull/867

yurishkuro commented 4 years ago

+1

pavolloffay commented 4 years ago

The remaining items here are: