Extend AgentHealth message to accommodate component health

mwear commented 1 year ago

The addition of component status reporting to the collector will allow us to have a more nuanced view of collector health by having visibility into the health of individual components. At a recent SIG meeting we discussed the possibility of extending the AgentHealth message to include a map that can be used to express the health of multiple components. There are open questions as to how we want restructure the message and how we can ensure that it remains agent agnostic. Do we want to keep the currently defined AgentHealth message for overall health, and add a map for component health, or do we want express everything as part of the map?

I'll suggest the following as a starting point, but expect that it is probably too coupled to our ideas around collector health. We can refine this as needed. This also retains the current AgentHealth fields as overall health.

message AgentHealth {
    bool healthy = 1;
    fixed64 start_time_unix_nano = 2;
    string last_error = 3;
    map<string, ComponentHealth> component_health_map = 4;
}

message ComponentHealth {
    fixed64 start_time_unix_nano = 1;
    ComponentStatusEnum status = 2;
    string last_error = 3;
}

enum ComponentStatusEnum {
    ComponentStatus_Starting = 0;
    ComponentStatus_OK = 1;
    ComponentStatus_RecoverableError = 2;
    ComponentStatus_PermanentError = 3;
    ComponentStatus_FatalError = 4;
    ComponentStatus_Stopping = 5;
    ComponentStatus_Stopped = 6;
}

I suspect dropping the status enum and replacing it with a string would be a good middle ground for making the ComponentHealth message more generalizable for non-collector agents. If anyone has thoughts or opinions, please comment.

tigrannajaryan commented 1 year ago

Discussed in WG meeting today. I believe @mwear plans take a look at the following:

Maybe do recursive components?
Move from enum to string for status
Add healthy field per component?
How/do we need to compress health status on the network?
Involve operator contributors, see how components apply there.

mwear commented 11 months ago

Based on the discussion at the last WG meeting, I have experimented with two additional ways to model the AgentHealthMessage. One is a simplified version of what was previously proposed by replacing the enum with a string and the other is a recursive message.

Option 1: Enum to String

message AgentHealth {
    bool healthy = 1;
    fixed64 start_time_unix_nano = 2;
    string last_error = 3;
    map<string, ComponentHealth> component_health_map = 4;
}

message ComponentHealth {
    bool healthy = 1;
    fixed64 timestamp = 2;
    string status = 3;
    string last_error = 4;
}

Option 2: Recursive Message

message AgentHealth {
    bool healthy = 1;
    fixed64 timestamp = 2;
    string status = 3;
    string last_error = 4;
    map<string, AgentHealth> agent_health_map = 5;
}

I prefer the recursive representation by far. It allows us to cleanly model collector status in terms of component status. The idea being, that we'd like to determine the health of the collector by the health of its pipelines, which in turn can be determined from the health of the components in each pipeline. Below is an example of how this could work using the recursive AgentHealth message.

Example Component Health

health := &protobufs.AgentHealth{
    Healthy:   false,
    Timestamp: 1234,
    Status:    "StatusRecoverableError",
    LastError: "Resource Exhausted",
    AgentHealthMap: map[string]*protobufs.AgentHealth{
        "pipeline:traces/0": {
            Healthy:   false,
            Timestamp: 1234,
            Status:    "StatusRecoverableError",
            LastError: "Resource Exhausted",
            AgentHealthMap: map[string]*protobufs.AgentHealth{
                "receiver:otlp": {
                    Healthy:   true,
                    Timestamp: 100,
                    Status:    "StatusOK",
                    LastError: "",
                },
                "processor:batch": {
                    Healthy:   true,
                    Timestamp: 100,
                    Status:    "StatusOK",
                    LastError: "",
                },
                "exporter:jaeger": {
                    Healthy:   false,
                    Timestamp: 1234,
                    Status:    "StatusRecoverableError",
                    LastError: "Resource Exhausted",
                },
            },
        },
        "pipeline:metrics/0": {
            Healthy:   true,
            Timestamp: 200,
            Status:    "StatusOK",
            LastError: "",
            AgentHealthMap: map[string]*protobufs.AgentHealth{
                "receiver:otlp": {
                    Healthy:   true,
                    Timestamp: 100,
                    Status:    "StatusOK",
                    LastError: "",
                },
                "processor:transform": {
                    Healthy:   true,
                    Timestamp: 100,
                    Status:    "StatusOK",
                    LastError: "",
                },
                "processor:batch": {
                    Healthy:   true,
                    Timestamp: 100,
                    Status:    "StatusOK",
                    LastError: "",
                },
                "exporter:otlp": {
                    Healthy:   true,
                    Timestamp: 200,
                    Status:    "StatusOK",
                    LastError: "",
                },
            },
        },
    },
}

The discussion that follows is specific to how we'd like to use this for OTel. The process described will be semantic conventions for us and will not be compulsory for other uses of the protocol.

As part of the component status reporting work: https://github.com/open-telemetry/opentelemetry-collector/pull/8169 we have introduced rules for aggregating multiple statuses into a single effective status (see the component.AggregateStatus function). We can group component status events by the pipelines the component belongs to, and compute an effective status for the pipeline. We can do this one more time and group by pipeline status and compute the effective status of the collector overall.

At each level of aggregation we can use the timestamp and or error information from the most recent event that is representative of the current aggregated status. For non-error statuses we can use the most recent status event and for error statuses we can use the most recent event with the matching aggregated error status. To streamline this process, I introduced a function called component.EffectiveStatus.

In this example, the aggregate status of the traces/0 pipeline is StatusRecoverableError. It corresponds to the exporter:jaeger status event and uses its error message and timestamp. The aggregate status of the metrics/0 pipeline is StatusOK. It corresponds to the exporter:otlp status event (due to the later timestamp) and has no error message. The later timestamp could indicate that there was a hiccup earlier, but the exporter recovered. The aggregate status of the two pipelines is StatusRecoverableError, which is the inherited status for the traces/0 pipeline (via the jaeger exporter). We use the same event details for the overall health of the collector in the top level AgentHealth message.

The timestamp for the events represents the time of the state change into the current status. This is a little bit different use of the timestamp from the current AgentHealth message, which actually calls this start_time_unix_nano. We might have to discuss whether or not we can make this change. If we must preserve the timestamp the start time of the agent, we could make it a semantic convention for the top level AgentHealth message. In any case, I'd be interested to hear what other people's thoughts on are this proposal.

jaronoff97 commented 11 months ago

A recursive approach would also be great for the operator group. We have been discussing distributed collector configuration in kubernetes and in that world we would probably want to report status for each component that makes up a collector group. Allowing us to represent the DAG of config we will be applying in kubernetes via this recursive message would enable just that. In this example, if an OTLP exporter CRD shared by multiple collector groups started failing, we would be able to represent it as a top level key in addition to the otel collector CRD. In this example, the bridge is reporting and healthy, however a collector CRD named gateway that uses an OTLP exporter CRD named vendorexporter are both unhealthy ex:

health := &protobufs.AgentHealth{
    Healthy:   true,
    Timestamp: 1234,
    Status:    "Okay",
    LastError: "",
    AgentHealthMap: map[string]*protobufs.AgentHealth{
                "default:otelcol/gateway": {
            Healthy:   false,
            Timestamp: 200,
            Status:    "StatusPermanentError",
            LastError: "Unknown Destination",
            AgentHealthMap: map[string]*protobufs.AgentHealth{
                                  "exporter/otlp/vendor": {
                    Healthy:   false,
                    Timestamp: 200,
                    Status:    "StatusPermanentError",
                    LastError: "Unknown Destination",
                },
                         },
        },
        "default:otlpexporter/vendorexporter": {
            Healthy:   false,
            Timestamp: 200,
            Status:    "StatusPermanentError",
            LastError: "Unknown Destination",
        },
    },
}

Beyond the operator's needs, in the future an agent would be able to report minimal change sets with this approach by specifying the keys for the component whose health changed. From Matt's example, if the metrics OTLP exporter suddenly had a problem the following could be reported:

health := &protobufs.AgentHealth{
    AgentHealthMap: map[string]*protobufs.AgentHealth{
        "pipeline:metrics/0": {
            AgentHealthMap: map[string]*protobufs.AgentHealth{
                "exporter:otlp": {
                    Healthy:   false,
                    Timestamp: 200,
                    Status:    "StatusRecoverableError",
                    LastError: "Resource Exhausted",
                },
            },
        },
    },
}

Doing it this way would allow us to minimize messages sent, which would be a net benefit for users.

tigrannajaryan commented 11 months ago

Question: will we have Otel-wide semantic conventions that define the values possible for the status field or do we just say that the values are agent-specific?

A few random comments:

We should not change the field number for last_error field unnecesarily.
Maybe rename AgentHealth message to ComponentHealth or something else that works for both top-level and components.
Maybe rename agent_health_map field to component_health_map in the recursive version.
We lost start_time in the recursive version. This was supposed to be useful for calculating "uptime" which is a common thing to show in a UI about a status of something that runs.

mwear commented 11 months ago

Question: will we have Otel-wide semantic conventions that define the values possible for the status field or do we just say that the values are agent-specific?

This is a good question. I don't have a definitive answer and I think it could go either way. We have a set of statuses we expect for the collector, but they are collector specific. We could say that the values are agent specific, but we could formalize the ones we know for the agents we care about. I think we should discuss this further and decide what to do.

If I understand your other comments correctly, you are suggesting that we keep a version of the AgentHealth message, and introduce a recursive ComponentHealth message that will be part of it. If that is correct, I have 2 possible representations (below). If I'm misunderstanding, let me know.

Recursive Option 2:

The AgentHealth message contains embedded fields for overall agent health, and a component_health_map specifically for components.

message AgentHealth {
    bool healthy = 1;
    fixed64 start_time_unix_nano = 2;
    string last_error = 3;
    string status = 4;
    fixed64 timestamp = 5;
    map<string, ComponentHealth> component_health_map = 6;
}

message ComponentHealth {
    bool healthy = 1;
    string status = 2;
    string last_error = 3;
    fixed64 timestamp = 4;
    map<string, ComponentHealth> component_health_map = 5;
}

Recursive Option 3:

The AgentHealth message is just a start_time and component_health_map. Overall agent health goes in the top level of the component_health_map.

message AgentHealth {
    bool healthy = 1 [deprecated=true];
    fixed64 start_time_unix_nano = 2;
    string last_error = 3 [deprecated=true];
    map<string, ComponentHealth> component_health_map = 4;
}

message ComponentHealth {
    bool healthy = 1;
    fixed64 timestamp = 2;
    string status = 3;
    string last_error = 4;
    map<string, ComponentHealth> component_health_map = 5;
}

Are either one of these on the right track?

evan-bradley commented 11 months ago

I would put in a vote for another variation, which is how I interpreted the suggestions @tigrannajaryan made:

message ComponentHealth {
    bool healthy = 1;
    fixed64 start_time_unix_nano = 2;
    string last_error = 3;
    fixed64 timestamp = 4;
    string status = 5;
    map<string, ComponentHealth> component_health_map = 6;
}

While we do lose the use of "agent" anywhere in the message, I think this provides the greatest flexibility. I think it is okay for agents and components of agents to both be called components, and I think there is validity in setting a start time for every component.

As for semantic conventions for the statuses, while the Collector component statuses are intended for the Collector specifically, they should apply well to any networked system. At a minimum, we could likely establish a similar set of statuses that could be generalized to most agents. I think making the field a string should be sufficient for any agents who have non-conformant statuses to report these and have them handled by the server as agent-specific.

mwear commented 11 months ago

This makes sense to me. I would imagine for the collector (for the time being) that the start time of its components will be the start time of the collector, but I could see this being useful in a system where components can be restarted.

tigrannajaryan commented 11 months ago

I would put in a vote for another variation, which is how I interpreted the suggestions @tigrannajaryan made:
message ComponentHealth {
    bool healthy = 1;
    fixed64 start_time_unix_nano = 2;
    string last_error = 3;
    fixed64 timestamp = 4;
    string status = 5;
    map<string, ComponentHealth> component_health_map = 6;
}

Yes, this looks better to me. We can probably also rename timestamp to last_timestamp or status_timestamp to make it clear how it differs from start_time.

I would also would like to understand how the last_error and status fields relate (if they do). Do we for example define a convention that if status=="error then last_error contains the error message?

mwear commented 11 months ago

I can get a PR together to add this message, but I have a couple of questions about the AgentToServer message. Do we intend to keep AgentHealth and add ComponentHealth or do we intend to replace AgentHealth with ComponentHealth. The replacement would likely entail deprecating the AgentHealth message and using ComponentHealth for component and agent health going forward.

Also in response to:

I would also would like to understand how the last_error and status fields relate (if they do). Do we for example define a convention that if status=="error then last_error contains the error message?

The last_error and status fields are related. We would like to define the convention as you've suggested.

evan-bradley commented 11 months ago

Do we intend to keep AgentHealth and add ComponentHealth or do we intend to replace AgentHealth with ComponentHealth. The replacement would likely entail deprecating the AgentHealth message and using ComponentHealth for component and agent health going forward.

I would advocate for taking advantage of the protocol's beta status to replace AgentHealth with ComponentHealth in a way that produces the cleanest possible resulting protobuf message. I think having the same message at every level is more straightforward than having an identical but differently-named message for the top level.

The beta definition linked to in the OpAMP spec seems to indicate it would be permissible to do this, do we have any obligations that would prevent it?

tigrannajaryan commented 11 months ago

I believe we need to replace AgentHealth with ComponentHealth. It is backward compatible manner from wire perspective since ComponentHealth is a superset of AgentHealth. It is not a breaking change even for old Servers, they should be able to receive the new ComponentHealth message, will ignore the new fields and will interpret the message correctly.

Compare:

message AgentHealth {
    bool healthy = 1;
    fixed64 start_time_unix_nano = 2;
    string last_error = 3;
}

with:

message ComponentHealth {
    bool healthy = 1;
    fixed64 start_time_unix_nano = 2;
    string last_error = 3;
    fixed64 timestamp = 4;
    string status = 5;
    map<string, ComponentHealth> component_health_map = 6;
}

It is a breaking change from the symbols perspective (the name of the message changes), but we are allowed to do it in Beta.

tigrannajaryan commented 10 months ago

@mwear is there anything else to do here or we can close the issue since the PR is merged?

mwear commented 10 months ago

The work has been addressed by #168. I'll go ahead and close this issue.

open-telemetry / opamp-spec