open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.37k stars 1.45k forks source link

Pipeline Metrics: End-to-End processing latency #542

Open rghetia opened 4 years ago

rghetia commented 4 years ago

This issue specifically covers end-to-end latency metric which is a part of overall pipeline metrics (#484).

There are few things to consider here.

Should there be one data point for each metric/span or should it be for a batch?

What happens when multiple batches are combined?

What if batches are split?

What labels should it have?

What if the element is dropped/filtered?

What should be the type of metric?

What about latency of individual stage in the pipeline?

tigrannajaryan commented 4 years ago

Do you intend to record end-to-end latency only or also latencies broken down by individual pipeline elements?

rghetia commented 4 years ago

Do you intend to record end-to-end latency only or also latencies broken down by individual pipeline elements?

my intention was to record end-to-end but broken down by individual pipeline would help to debug if there is an issue. I'll update the description to include that.

x13n commented 4 years ago

Why not add n data points to the metric whenever a batch is exported? Within a single batch, metrics may have different timestamps if they were buffered, so that approach will make it independent on how they were batched, solving the issue of splitting/merging the batches during processing.

dashpole commented 3 years ago

I'd like to work on this, if thats alright.

I did an initial investigation, but i'm still learning about the structure of the collector. One thing I ran into is that since the pipeline isn't synchronous, I can't just wrap a function call with start/stop functions. One thought I had is to propagate the receive time to the exporters using context, but that seems like it could be fragile (e.g. if a processor doesn't correctly propagate context). Are there other suggestions anyone has?

dashpole commented 3 years ago

Comments from bogdan on 4/28:

x13n commented 3 years ago

The idea I initially had for this was to re-use metric timestamps and record the delta between metric timestamp and Now() when the metric is successfully exported. That wouldn't be just processing time though - it would be measuring collection (possibly outside of the Collector) as well.

x13n commented 3 years ago

(In which case it should probably be called metric freshness or something.)

dashpole commented 3 years ago

I have implemented prototypes of a few approaches:

  1. Pipeline latency with startTime propagated via context. Measures from obsreport.StartXXReceiveOp to obsresport.EndXXExportOp: changes
  2. Pipeline latency with startTime embedded in pdata types. Measures from pdata.NewXX to obsreport.RecordPipelineDuration (basically at obsresport.EndXXExportOp): changes
  3. Component latency with startTime propagated via context. Measures from obsreport.StartXXYYOp to obsreport.EndXXYYOp: changes
dashpole commented 3 years ago

A few thoughts:

3 is definitely the simplest, and aligns with the other obsreport telemetry we generate (per-component successes vs failures, per-component traces). Individual component processing durations aren't as good of a representation of end-user-experience as e2e latency is, but it is probably good enough to at least detect high-latency problems.

If we want to implement a full solution to this issue by adding true end-to-end duration metrics, I tend to prefer approach 1. Approach 2 Is a little tricky to get right, at least in my experience implementing it because components are free to create pdata.Metrics whenever they need to. For example, the batch processor creates new ones to hold batches. We would either have to prevent components from creating new pdata.Metrics (for example, by making them merge or split them with our helper functions), or expose the detail of passing the startTime to all components, and make them handle it correctly. With context, that can all be a bit more invisible, as we can just provide something like obsreport.MergeContext(primary, appended context.Context) to handle cases like merging batches.

dashpole commented 3 years ago

Notes from 5/12:

dashpole commented 3 years ago

I have a working draft for context-based e2e metrics that works for the batch processor: https://github.com/open-telemetry/opentelemetry-collector/pull/3183.

Here is the description of the metric from the draft, which explains how we handle batching and multiple exporters:

    // mPipelineProcessingMetrics records the an observation for each
    // received batch each time it is exported. For example, a pipeline with
    // one receiver and two exporters will record two observations for each
    // batch of telemetry received: one observation for the time taken to be
    // exported by each expoter.
    // It records an observation for each batch received, not for each batch
    // sent. For example, if five received batches of telemetry are merged
    // together by the "batch" processor and exported together it will still
    // result in five observations: one observation starting from the time each
    // of the five original batches was received.
    mPipelineProcessingMetrics = stats.Float64(
        "pipeline_processing_duration_seconds",
        "Duration of between when a batch of telemetry in the pipeline was received, and when it was sent by an exporter.",
        stats.UnitSeconds)

Feel free to provide feedback on the draft. If the general approach is acceptable, i'll finish it up and send it out for review.

alolita commented 3 years ago

@bogdandrutu @tigrannajaryan following up on this issue - please provide feedback for @dashpole's suggested approaches?

tigrannajaryan commented 3 years ago

@alolita I provided the feedback here https://github.com/open-telemetry/opentelemetry-collector/pull/3183#issuecomment-880101600

peachchen0716 commented 1 year ago

Hi @bogdandrutu I am looking for the e2e metrics too and the attempt to add this metric seems to be blocked. Do you mind take a look and share some context on the blocker? Thanks!

peachchen0716 commented 1 year ago

Hi @bogdandrutu I am looking for the e2e metrics too and the attempt to add this metric seems to be blocked. Do you mind take a look and share some context on the blocker? Thanks!

Hi @codeboten and @dmitryax, can I get some context/status on PdataContext mentioned in the closed pr for the issue? Is it still the recommended solution? Thanks!