open-telemetry / opentelemetry-proto

OpenTelemetry protocol (OTLP) specification and Protobuf definitions
https://opentelemetry.io/docs/specs/otlp/
Apache License 2.0
545 stars 243 forks source link

Why is AggregationTemporality redefined pprofextended.proto #547

Open jack-berg opened 2 months ago

jack-berg commented 2 months ago

Why is AggregationTemporality redefined in this file instead of reusing the definition in metrics.proto?

_Originally posted by @jack-berg in https://github.com/open-telemetry/opentelemetry-proto/pull/534#discussion_r1552403726_

aalexand commented 1 month ago

I think this profiler's proto enum was copied from the metrics proto because we expected that it would need some customization. That customization hasn't happened yet.

TBH the current DELTA and CUMULATIVE values in the num are not 100% match, I think they still (especially the comments) carry a bit of monitoring specifics, such as server restarts.

For profiling, common metric types are:

1) Metric accumulated over the profiling duration. Example - CPU time profiler which runs for certain period of time (e.g. 10 seconds) and reports CPU*seconds consumed over that duration. I guess DELTA is more or less a fit here but the comment needs to be updated to be more relatable to profiling and include some examples (like CPU profiling that I mentioned). 2) Gauge metrics that represent instantaneous value. Example - RAM / heap usage reported by a durationless profiler that simply takes a snapshot of the heap at a point in time. I don't think either DELTA or CUMULATIVE are a fit here.

There is also a flavor of 1) where the profiler returns a snapshot of the data, but the data is actually cumulative since the process start. I think this still fits the DELTA type and it's just a matter of capturing in the profile the right time window to express over what time interval the data was recorded.

aalexand commented 1 week ago

There is also a flavor of 1) where the profiler returns a snapshot of the data, but the data is actually cumulative since the process start. I think this still fits the DELTA type and it's just a matter of capturing in the profile the right time window to express over what time interval the data was recorded.

Thinking more about this, I think we do need to distinguish "delta over the profiling duration" (example: CPU time over 10 seconds in a CPU profile), "cumulative since process start" (example: cumulative alloc bytes in Go heap profiles) and "gauge aka instantaneous value" (example: current heap live bytes in a C++ or Go heap profile's "in use" metrics).

I also wonder if the "aggregation temporality" name could be clearer.