Open jmacd opened 3 years ago
Another reason for this semantic convention worth mentioning:
There is an interest in translating Prometheus Remote Write streams into OTLP streams, where data points with Cumulative start time SHOULD have a start time. Traditional Prometheus reporting does not include this information, thus it uses a reset heuristic for detecting when cumulative series are reset. When there is a process.start_time
resource present, Prometheus Remote Write streams can be converted to OTLP streams with correct start times (note: requires also the recently added Prometheus Remote Write metadata support).
@jmacd we discussed this briefly during today's SIG Spec meeting. Is my understanding correct that today you would go with the "more traditional" approach of a process.uptime
metric vs. the process.start_time
attribute?
Maybe we can add both process.uptime
metric name and process.start_time
attribute to the convention? After all, this does not mean that OpenTelemetry must emit those, it just specifies the canonical name for these things; whether or not these will be used by an OT component is not a concern here. Is this correct? Example: Assuming there's a metrics generation engine that generates the "process uptime" metric (e.g. Telegraf) and a user wants to collect metrics from that engine with OT, that would help them define the OT name for it. Same with the "process start time" attribute. Does it make sense?
Yes. I agree that both specifications are good to have.
process.uptime
: defined as a non-monotonic counter to signal that reset is not meaningfully permitted
process.start_time
: an attribute with a start timestamp (in a specified format)
It would be nice to establish a semantic connection between these-- that is the suggestion made in this issue originally. If you have are holding a Span object with a process.start_time
resource, you may infer semantically that the process had an uptime of Span.start_time - Resource[process.start_time]
when it started and Span.end_time - Resource[process.start_time]
when it finished.
I've just noticed that the Elastic Common Schema defines this as process.start with a value of e.g. 2016-05-23T08:05:34.853Z
, i.e. a UTC, ISO-formatted time stamp (with a millisecond precision, or perhaps with an undefined precision?). Perhaps this is the way to go, what do you think @jmacd?
I like process.start_time
as more readable option. We also have a similar resource attribute in collector's k8s processor k8s.pod.start_time
that should be defined in the spec as well.
What are you trying to achieve?
There has been some discussion about an Uptime metric. For example, the OpenTelemetry-Go
runtime
instrumentation includes one:https://github.com/open-telemetry/opentelemetry-go-contrib/blob/d1534b84593e617bff9a848454a992a7af49385c/instrumentation/runtime/runtime.go#L122
There is a related request for an
up
metric, meaning something like "was able to produce metrics" in #1078. The uptime metric is different and can be used for monitoring process longevity, for example. There is a question of whether we should standardize a semantic-conventional metric name for uptime.However, note that when we know the process start time, we are able to deduce the uptime provided we know that a process was up. Logically, a combination of the
up
metric and aprocess.start_time
resource combine so that we can synthesize anprocess.uptime
metric.I've encountered a reason to prefer the use of a
start_time
resource and anup
metric as opposed to anprocess.uptime
metric, stated as follows.An
UpDownSumObserver
instrument writes an OTLP Non-Monontonic Cumulative Sum data point, there is a well-defined conversion to Gauge in systems such as Prometheus that do not recognize Non-Monotonic Cumulatives. AnUpDownCounter
instrument writes an OTLP Non-Monotonic Delta Sum data point for the Stateless export configuration, but it is converted to a Cumulative in the default configuration. As long as the state that we maintain in an SDK for Delta-to-Cumulative conversion is never reset, there is no difference to the consumer of an OTLP Non-Monotonic Cumulative Sum (OTLP-NMCS) data point whether it was originally anUpDownSumObserver
or anUpDownCounter
.If we move the Delta-to-Cumulative conversion out of the process (e.g., into a sidecar), then there may be a difference between an OTLP-NMCS that was reset and one that was never reset. We could use the start-time resource to detect this difference. This feels significant because ultimately, if the user is going to view a Cumulative Sum as its current, total value, then we should know whether it's the cumulative from the beginning of the process or cumulative from an arbitrary reset point. In a user-interface for a OTLP-NMCS timeseries, I would consider a generating an error to say that for Non-Monotonic Sums that have been reset you should only use Rate views, not Total views.
Concretely speaking the proposed semantic convention would be named
process.start_time
and would be documented here.