open-telemetry / opamp-spec

OpAMP Specification
Apache License 2.0
108 stars 34 forks source link

Clarify agent health reporting #136

Closed tigrannajaryan closed 1 year ago

tigrannajaryan commented 2 years ago

The AgentHealth currently has an up field and a last_error fields.

It is not clear how to set fields if the agent process is started and running but it is unhealthy (e.g. we have a way to verify its health by polling a health check endpoint). Should we set up to true or false in this case?

The up field definition is

Set to true if the Agent is up and running.

So, it seems like we should set it to true. However, there is no other explicitly defined way to indicate unhealthiness, unless we assume the presence of last_error is that indicator.

We need to either clarify the spec to say last_error is the indicator or add another field to indicate the unhealthiness (e.g. bool healthy), or maybe rename up to healthy?

tigrannajaryan commented 2 years ago

@andykellr @PeterF778 what do you think?

andykellr commented 2 years ago

I agree that this is unclear in the spec. I think healthy is a better name. I think an agent that is down is also unhealthy so I do not think we currently need another field to represent running/not-running.

tigrannajaryan commented 2 years ago

I agree that this is unclear in the spec. I think healthy is a better name. I think an agent that is down is also unhealthy so I do not think we currently need another field to represent running/not-running.

What do we do with start_time_unix_nano in that case? It is said to be set when up is true. Should we untie these 2 fields?