Semantic conventions for Uptime Monitoring

open-telemetry / oteps

OpenTelemetry Enhancement Proposals

https://opentelemetry.io

Apache License 2.0

337 stars 164 forks source link

Semantic conventions for Uptime Monitoring #185

Open jsuereth opened 2 years ago

jsuereth commented 2 years ago

A proposal and guide.

jmacd commented 2 years ago

I'd like to consider an alternative not mentioned in this document, and I'm not sure where to propose it.

Instead of two metrics, "health" and "uptime", I propose a single non-monotonic Sum named "alive" with value 1. This data type requires that a start time be included with the measurement, unlike Gauge. The difference between the start time and the measurement time is the process uptime.

I have made this proposal already in connection with https://github.com/open-telemetry/opentelemetry-specification/issues/1078, where I pointed out that we can implement service discovery in a push-based metrics system by joining this "alive" metric with information retrieved by service discovery.

jsuereth commented 2 years ago

@jmacd Commented offline, but recording here for posterity.

Instead of two metrics, "health" and "uptime", I propose a single non-monotonic Sum named "alive" with value 1. This data type requires that a start time be included with the measurement, unlike Gauge. The difference between the start time and the measurement time is the process uptime.

From a pure collection standpoint, I like a lot of what this brings, however I think we need to take an end-to-end focus. Specficially: "Can I write a query / dashboard / alert to solve the stated use cases?"

AFAICT, with known backends/query languages (Prometheus, Graphite, etc.) it's hard to pull the data back out, specifically the "Seconds since start" value in PromQL. We should make sure we have an answer to that.

tedsuo commented 1 year ago

@jsuereth how important/relevant is this OTEP? Please assign an appropriate priority, or close if it's old and we no longer need it.

tomasmota commented 1 year ago

What is the state of this? It is still not clear to me how to implement this in otel. I suppose uptime is ok, but the health metric as 1|0 makes it not so useful. Should I then just do uptime for both, and only update health if the checks succeed?

Is it not a common use-case that most services would need this in some way? Or are people just relying directly on kubernetes checks instead? I understand that metric such as ops/sec. are much better, but not all services are doing stuff all the time, so this is much needed for those.

I had made an issue on this but closed it expecting this might progress. https://github.com/open-telemetry/opentelemetry-specification/issues/2923

erasmas commented 1 year ago

I'm also curious about the state of this proposal since I'm having the same use case as described in https://github.com/open-telemetry/opentelemetry-specification/issues/2923

tedsuo commented 1 year ago

@jsuereth is this stale, or is semconv currently working on this?

Manuelraa commented 1 year ago

I would also be interested in this. A generic up metric for creation of generic uptime alerting would be awesome. Especially having it in the standard itself and e.g. integrated to OpenTelemetry Collector.