open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.39k stars 1.45k forks source link

Simple processor metrics #10708

Closed djaglowski closed 2 weeks ago

djaglowski commented 3 months ago

Is your feature request related to a problem? Please describe.

We currently define a set of metrics for processors which cannot be automatically recorded by the collector framework, and as far as I can tell are not actually recorded by any processors in core or contrib. In short, the metrics describe the number of items accepted, dropped, inserted, and refused. (I believe these metrics are loosely inspired by proposals such as https://github.com/open-telemetry/oteps/pull/259, but no such design has actually landed.)

We also have proposals such as https://github.com/open-telemetry/opentelemetry-collector/issues/10355 which would measure size of data according to the same categories, but again, it is not clear that there is any path to automatically recording these values. In the case of size metrics, it would additionally be very difficult for processor developers to accurately capture such measurements because it's unclear how to attribute inserted, dropped, or refused sizes as a portion of an overall payload.

Describe the solution you'd like

Looking at this from a collector implementation perspective, what we clearly can observe automatically, is the data being passed in and out of each processor. It would be difficult to determine precisely which parts of that data were inserted or deleted, but we can certainly count the number of items and measure the size of the payload (according to a standard such as ProtoMarshaler.Sizer()).

Therefore, I propose that we should define new metrics in processorhelper which are automatically recorded. If another more nuanced proposal eventually lands, we can add those later.

Importantly, in order to handle error propagation consistently, these metrics would be recorded after the processor successfully passes data to the next consumer. This can be accomplished within processorhelper by looking at data before and after the processing function, and taking into consideration the result of passing the data along to the next consumer.

Specifically, I propose the following metrics per data type. I'm showing only logs but assume there would be equivalents for metrics and traces.

Incoming and outgoing counts.

    otelcol_processor_incoming_log_records: # or processor_consumed_log_records
      enabled: true
      description: Number of log records passed to the processor.
      unit: "{records}"
      sum:
        value_type: int
        monotonic: true
    otelcol_processor_outgoing_log_records: # or processor_emitted_log_records
      enabled: true
      description: Number of log records emitted from the processor.
      unit: "{records}"
      sum:
        value_type: int
        monotonic: true

Incoming and outgoing sizes. These would be measured only if enabled.

    otelcol_processor_incoming_log_size: # or processor_consumed_log_size
      enabled: false # by default
      description: Total size of log records passed to the processor.
      unit: "By"
      sum:
        value_type: int
        monotonic: true
    otelcol_processor_outgoing_log_size: # or processor_emitted_log_size
      enabled: false # by default
      description: Total size of log records emitted from the processor.
      unit: "By"
      sum:
        value_type: int
        monotonic: true
yurishkuro commented 3 months ago

Therefore, I propose that we should define new metrics in processorhelper which are automatically recorded. If another more nuanced proposal eventually lands, we can add those later.

I would caution against automatically recording size, it's not that cheap to compute and the value of such metric is lower than plain counts.

djaglowski commented 3 months ago

I would caution against automatically recording size, it's not that cheap to compute and the value of such metric is lower than plain counts.

I agree. To clarify, I would expect that size metrics are disabled by default and the cost should only be incurred if enabled.

Edit: What I mean by "automatically" here is that the individual components do not need to be concerned with recording them. They would be handled in a standard way by the collector's processorhelper package.

atoulme commented 3 months ago

I have been working on offering guidance for metrics exposed by the collector here: https://github.com/open-telemetry/opentelemetry-collector/pull/10365

From that guidance, my question would be: what attributes do you want to set on those metrics, and what cardinality will they have? It looks like you have none defined in your code sample, which would make identifying the processor used difficult.

@codeboten has a RFC out for internal telemetry https://github.com/open-telemetry/opentelemetry-collector/pull/10695/files

From that RFC, should we consider to prefix metrics with otelcol_?

djaglowski commented 3 months ago

what attributes do you want to set on those metrics, and what cardinality will they have? It looks like you have none defined in your code sample, which would make identifying the processor used difficult.

Generally, I would expect that the attributes should identify the specific instance of the processor within the collector. Today, that can be established with the following info:

  1. Component ID as specified in the config. e.g. batch or batch/1
  2. Pipeline ID in which the processor is used.

That said, the pipeline ID may be unnecessary if we implement https://github.com/open-telemetry/opentelemetry-collector/pull/10664. The reason is because if we instance processors by data type instead of pipeline, then the only identifying info in the pipeline ID is the data type itself but all proposed processor metrics are already differentiated by data type. Therefore, I do not believe we'd need anything besides the component ID, which we already have in processorhelper.ObsReport.

Thanks for pointing out the RFC - I will take a look at it and consider how this proposal might fit.

djaglowski commented 1 month ago

Discussing with @codeboten and @mx-psi, we think it would make sense to add an outcome dimension to the outgoing metrics. In the short term, this would have two possible values, e.g. success of failure, purely based on whether an error is returned by the next consumer. Longer term, we may add additional granularity to failures based on the types of errors returned by the next consumer. In any case, it's important that all details are captured automatically so that we consistency. Someone may also propose additional metrics at a later time which are manually recorded by component authors.

danelson commented 4 weeks ago

Not sure this is the right place to ask this but is there a plan to provide parity with refused/dropped as it relates to the memory limiter processor? I might be missing something but I don't think I can reason about back pressure and data loss in the same way I was able to previously.

djaglowski commented 3 weeks ago

Not sure this is the right place to ask this but is there a plan to provide parity with refused/dropped as it relates to the memory limiter processor? I might be missing something but I don't think I can reason about back pressure and data loss in the same way I was able to previously.

It probably requires a separate proposal but I doubt there would be resistance to the idea. This proposal isn't arguing against that. I am just advocating for doing what can be done for all processors so that we have some uniform metrics.

Generally speaking, we should add useful metrics to individual component, so I think you could propose to re-add refused/dropped. I also think refused/dropped may be universally relevant, even if theycannot be automatically recorded. Therefore, perhaps the end state is that we formalize the semantics of those metrics, but leave implementation to component authors. Personally, I think we need to get the basics in place before there is bandwidth to deal with the formal semantics, but having refused/dropped where they already were seems reasonable to me.

djaglowski commented 2 weeks ago

Closing in favor of #11343