Basic health state in OpAMP spec

tigrannajaryan commented 2 years ago

We previously had "health" as a concept in OpAMP. It was removed in favour of reporting it as a metric that was planned to be added to Otel spec. However the metric is still undefined in Otel spec and unclear when it will be.

On the other hand it may be beneficial to have basic health state reported as part of OpAMP spec. This can be useful if the OpAMP Server wants to make decisions based on the health state and does not want to depend on fetching this health state from the independent and potentially unavailable metric store where the Agent sends its metrics to.

For example if the Server perform a mass update of Agents to a newer version it may want to monitor the health of the fleet and make a decision about rolling back the update if a certain percentage of Agents is unhealthy after a period of time after the update is pushed to the Agents. For Server to be able to do so it probably needs some basic health information, such as Healthy (bool), Uptime (when did the Agent start), etc.

tigrannajaryan commented 2 years ago

@pmm-sumo I am curious what you think about this. You had some ideas about own telemetry.

pmm-sumo commented 2 years ago

I shared my thoughts few months ago and I think the concepts have evolved since quite a bit.

There are several ideas floating around, let me try to summarize some of them (I'm sure there's more). Broadly speaking, we can assume there's a supervisor which can report one of the following:

whether agent is healthy or not (+/- the removed HealthStatus idea)
agent up(time)/down metrics (the one that was supposed to be added)
agent's metrics, which can be used to determine the status (own metrics reporting)
agent's logs/events on each significant state change (own logs reporting)
each agent's component status (component status)

Out of this, I find the last one especially compelling. It perhaps revolves around OpenTelemetry Collector but I think it's generic enough to be considered as part of OpAMP.

The way it could work is we could extend the collector component interface with a function that could emit the current health status (either periodically, on each state change or such). Then we could have more fine-grained details which can make diagnostics much more efficient. It could solve following cases:

report everything is operational
report if receiver observes some number of failed requests (e.g. number of dropped spans is above some threshold) or cannot access some resources (e.g. filelog receiver does not have permissions to open a log)
report if processor is unable to operate (e.g. k8sattributes processor cannot connect to the cluster API)
report if collector is having significant cpu or memory usage
report if exporter is observing elevated number of requests

Many of the above could be solved by a mix of own metrics or logs BTW. The benefit is that as health events there would be certain associated semantics (component ID, severity, timestamp started, timestamp ended) which could then be used to quickly pinpoint collector outages, cause of problems, etc.

This can follow immutable records approach and emit change only. The model could be something of following:

Event status: WARNING
Event time: Feb 18th, 17:28:01.000
Component: exporters/otlphttp
Name: Elevated number of errors when attempting to export traces
Attributes: { "endpoint": "http://example.com:4318/v1/traces", "count": 3 }

Event status: FAILURE
Event time: Feb 18th, 17:30:01.000
Component: exporters/otlphttp
Name: High number of errors when attempting to export traces
Attributes: { "endpoint": "http://example.com:4318/v1/traces", "count": 45 }

Event status: OK
Event time: Feb 18th, 17:35:01.000
Component: exporters/otlphttp
Name: No errors observed
Attributes: { "endpoint": "http://example.com:4318/v1/traces" }

This is much more fine-granular than just up/down but then it is also much more actionable

tigrannajaryan commented 2 years ago

each agent's component status (component status)

I think this is definitely worth exploring. One possible approach is to define a set of semantic conventions for component metrics. Then these metrics can be included in "own metrics" and don't require any special facility in OpAMP.

The benefit is that as health events there would be certain associated semantics (component ID, severity, timestamp started, timestamp ended)

component ID, severity, can be attributes defined by semantic conventions. Timestamps are regular part of metrics.

It is probably better to discuss component health/metrics in a separate issue.

Going back to the original topic of this issue which is somewhat different. My question is do we want the basic health information in OpAMP protocol itself.

pmm-sumo commented 2 years ago

Going back to the original topic of this issue which is somewhat different. My question is do we want the basic health information in OpAMP protocol itself.

I believe it will add value, especially for cases where Agent was not able to load the current config or had other issues starting. Supervisor could detect such problems and report back health status, which might used as a trigger to rollback deployment and such.

tigrannajaryan commented 2 years ago

An example of a logic that the Server may want to implement which requires basic health knowledge:

When doing rolling updates of the Agent fleet if 90% of Agents don't become healthy 5 mins after the new package version is pushed and installed then assume the package is broken and suspend the rollout of the new version (and also alert the user or do rollback to previous version, etc).

Note that logic like this cannot be implemented in the Supervisor since it requires aggregate knowledge of the fleet (90% in the example).

tigrannajaryan commented 2 years ago

Discussed this briefly in the workgroup and @andykellr also thought that this not a bad idea.

I suggest that we post here the list of possible basic health metrics that we think is useful to have in OpAMP itself and then we can decide which of those we want to add to the spec.

tigrannajaryan commented 2 years ago

I think we can start by having up (boolean), uptime (seconds) and healthy (boolean) flags.

Note that when using WebSocket transport the Agent doesn't have to report uptime periodically. Reporting it once should be sufficient. The Server can just infer the up-to-date value of uptime from the last reported value plus time passed since then and assuming there was no report of up=false. The Agent is responsible to report up=false as necessary and disconnecting the WebSocket can be treated as up=false after certain timeout (unless reconnected).

andykellr commented 2 years ago

I think I would have a slight preference to use something like started_at as a timestamp and if provided uptime, I would probably compute and store a time based on now-uptime so that uptime can be accurately reflected in the UI as now-started_at.

Computing now-uptime suffers from latency (uptime reported by agent, now computed by server) and started_at suffers from clock skew between agent and server system. I'm not sure which is worse.

Edit: Thinking about it more, clock skew is probably worse and latency would be low enough to be insignificant for reporting purposes. Just thought it might be worth mentioning.

tigrannajaryan commented 2 years ago

Thinking about it more, clock skew is probably worse and latency would be low enough to be insignificant for reporting purposes. Just thought it might be worth mentioning.

This is my feeling as well. The latency is going to be limited, typically under a second and likely never more than tens of seconds even if the request handling is very slow. Clock skew can be in theory arbitrarily large (although that also would be very atypical in a properly managed environment).

tigrannajaryan commented 2 years ago

I created a draft PR with basic health added. I wanted to see what it looks like to use it in the Supervisor. I ended up adding the following health message:

message AgentHealth {
    // The hash of the content of all other fields (even if the other fields are omitted
    // for compression).
    bytes hash = 1;

    // Set to true if the Agent is up an running.
    bool up = 2;

    // Time since the Agent is up.
    int64 uptime_nanoseconds = 3;

    // Human readable error message if the Agent is in erroneous state. Typically set
    // when up==false.
    string last_error = 4;
}

The last_error is new, we did not discuss it in this thread. However, I found that almost everywhere were I want to report "up=false" I also have an appropriate error message to include. I think it is useful to include it for basic troubleshooting purposes, when the full logs are not available (like they are not available in our simple example).

I did not include the healthy bool flag mentioned above since I did not have a good way to populate in our Supervisor. This can be added later if we are sure it is necessary.

Here is the draft PR: https://github.com/open-telemetry/opamp-go/pull/92

What do you think?

andykellr commented 2 years ago

I think that implementation looks good. I have a few small pieces of feedback for the PR if you think it's ready for review, but in general I like the implementation.

tigrannajaryan commented 2 years ago

Thanks for looking @andykellr

@pmm-sumo I am also interested in your opinion. :-)

andykellr commented 2 years ago

As mentioned in the sig meeting, I think we should change uptime_nanoseconds to a started_at value to allow for compression using the hash.

pmm-sumo commented 2 years ago

Apologies for the late response, I think it's a nice idea to have. We've been actually discussing internally if OpAMP could be a replacement for "heartbeat" capability we currently have for our collector and the above would be a nice fit there

tigrannajaryan commented 2 years ago

As mentioned in the sig meeting, I think we should change uptime_nanoseconds to a started_at value to allow for compression using the hash.

Agreed. Another benefit of started_at is that it makes easy to detect restarts.

tigrannajaryan commented 2 years ago

Draft PR updated to used StartTime (that's the name we use in OTLP so I went with it instead of StartedAt): https://github.com/open-telemetry/opamp-go/pull/92

tigrannajaryan commented 2 years ago

I updated the draft Go implementation to reflect the discussion above and recent changes to hashes/compression. It now uses the following health message:

message AgentHealth {
    // Set to true if the Agent is up and running.
    bool up = 1;

    // Timestamp since the Agent is up, i.e. when the agent was started.
    // Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970.
    // If "up" is false this field is unused.
    fixed64 start_time_unix_nano = 2;

    // Human readable error message if the Agent is in erroneous state. Typically set
    // when up==false.
    string last_error = 3;
}

open-telemetry / opamp-spec

Basic health state in OpAMP spec #62