Pipeline Metrics: End-to-End processing latency

rghetia commented 4 years ago

This issue specifically covers end-to-end latency metric which is a part of overall pipeline metrics (#484).

There are few things to consider here.

Should there be one data point for each metric/span or should it be for a batch?

Batch is preferred because metrics in a batch would have the same latency.

What happens when multiple batches are combined?

It should keep track of receive time for each batch. It should then record latency for original batch. For example, if batch A and B are combined into C then when C is exported it should record two latency (Tc-Ta) and (Tc-Tb)

What if batches are split?

If the batches are split it should again record latency for each split batches. For example if batch A is split into two batches B and C then record latency (Tb-Ta) and (Tc-Ta)

What labels should it have?

Receiver and Exporter at minimum. This should define the pipeline. Processors may not be required.

What if the element is dropped/filtered?

May be record latency when it is dropped/filtered. If this is the case then instead of Receiver/Exporter label it should be Begin/End label. Begin could be any receiver and End could be exporter or a processor.

What should be the type of metric?

Histogram (Distribution)

What about latency of individual stage in the pipeline?

This would help to debug/isolate the problem when the end-to-end latency is high.
The latency can be recorded using same measure with Start and End label containing the pipeline/stage name.

tigrannajaryan commented 4 years ago

Do you intend to record end-to-end latency only or also latencies broken down by individual pipeline elements?

rghetia commented 4 years ago

Do you intend to record end-to-end latency only or also latencies broken down by individual pipeline elements?

my intention was to record end-to-end but broken down by individual pipeline would help to debug if there is an issue. I'll update the description to include that.

x13n commented 4 years ago

Why not add n data points to the metric whenever a batch is exported? Within a single batch, metrics may have different timestamps if they were buffered, so that approach will make it independent on how they were batched, solving the issue of splitting/merging the batches during processing.

dashpole commented 3 years ago

I'd like to work on this, if thats alright.

I did an initial investigation, but i'm still learning about the structure of the collector. One thing I ran into is that since the pipeline isn't synchronous, I can't just wrap a function call with start/stop functions. One thought I had is to propagate the receive time to the exporters using context, but that seems like it could be fragile (e.g. if a processor doesn't correctly propagate context). Are there other suggestions anyone has?

dashpole commented 3 years ago

Comments from bogdan on 4/28:

Context may not be right, since context is lost when we do batching.
nit: maybe not call this latency, call it processing time.

x13n commented 3 years ago

The idea I initially had for this was to re-use metric timestamps and record the delta between metric timestamp and Now() when the metric is successfully exported. That wouldn't be just processing time though - it would be measuring collection (possibly outside of the Collector) as well.

x13n commented 3 years ago

(In which case it should probably be called metric freshness or something.)

dashpole commented 3 years ago

I have implemented prototypes of a few approaches:

Pipeline latency with startTime propagated via context. Measures from obsreport.StartXXReceiveOp to obsresport.EndXXExportOp: changes
Pipeline latency with startTime embedded in pdata types. Measures from pdata.NewXX to obsreport.RecordPipelineDuration (basically at obsresport.EndXXExportOp): changes
Component latency with startTime propagated via context. Measures from obsreport.StartXXYYOp to obsreport.EndXXYYOp: changes

dashpole commented 3 years ago

A few thoughts:

3 is definitely the simplest, and aligns with the other obsreport telemetry we generate (per-component successes vs failures, per-component traces). Individual component processing durations aren't as good of a representation of end-user-experience as e2e latency is, but it is probably good enough to at least detect high-latency problems.

If we want to implement a full solution to this issue by adding true end-to-end duration metrics, I tend to prefer approach 1. Approach 2 Is a little tricky to get right, at least in my experience implementing it because components are free to create pdata.Metrics whenever they need to. For example, the batch processor creates new ones to hold batches. We would either have to prevent components from creating new pdata.Metrics (for example, by making them merge or split them with our helper functions), or expose the detail of passing the startTime to all components, and make them handle it correctly. With context, that can all be a bit more invisible, as we can just provide something like obsreport.MergeContext(primary, appended context.Context) to handle cases like merging batches.

dashpole commented 3 years ago

Notes from 5/12:

Lets try doing something context-based
Batch, groupby processors will be problematic. Needs more prototyping.
We need to be specific about exactly what we are measuring. What happens if batches are merged?

dashpole commented 3 years ago

I have a working draft for context-based e2e metrics that works for the batch processor: https://github.com/open-telemetry/opentelemetry-collector/pull/3183.

Here is the description of the metric from the draft, which explains how we handle batching and multiple exporters:

    // mPipelineProcessingMetrics records the an observation for each
    // received batch each time it is exported. For example, a pipeline with
    // one receiver and two exporters will record two observations for each
    // batch of telemetry received: one observation for the time taken to be
    // exported by each expoter.
    // It records an observation for each batch received, not for each batch
    // sent. For example, if five received batches of telemetry are merged
    // together by the "batch" processor and exported together it will still
    // result in five observations: one observation starting from the time each
    // of the five original batches was received.
    mPipelineProcessingMetrics = stats.Float64(
        "pipeline_processing_duration_seconds",
        "Duration of between when a batch of telemetry in the pipeline was received, and when it was sent by an exporter.",
        stats.UnitSeconds)

Feel free to provide feedback on the draft. If the general approach is acceptable, i'll finish it up and send it out for review.

alolita commented 3 years ago

@bogdandrutu @tigrannajaryan following up on this issue - please provide feedback for @dashpole's suggested approaches?

tigrannajaryan commented 3 years ago

@alolita I provided the feedback here https://github.com/open-telemetry/opentelemetry-collector/pull/3183#issuecomment-880101600

peachchen0716 commented 1 year ago

Hi @bogdandrutu I am looking for the e2e metrics too and the attempt to add this metric seems to be blocked. Do you mind take a look and share some context on the blocker? Thanks!

peachchen0716 commented 1 year ago

Hi @bogdandrutu I am looking for the e2e metrics too and the attempt to add this metric seems to be blocked. Do you mind take a look and share some context on the blocker? Thanks!

Hi @codeboten and @dmitryax, can I get some context/status on PdataContext mentioned in the closed pr for the issue? Is it still the recommended solution? Thanks!

open-telemetry / opentelemetry-collector