Proposal: Intelligent Health Monitoring for the Collector

PettitWesley commented 3 years ago

Our goal is to make it easier to “observe the observers”

Currently the collector has a health extension: https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/healthcheckextension

However it is very rudimentary. We want to build a more sophisticated health extension that will emit a health status based on whether or not the collector is functioning properly and achieving the user’s end goal.

Principles/Goals

Intelligent: The health extension should be “smart”, it should emit healthy or healthy based on whether or not the collector is “working as expected”. The definition of “working as expected” is fuzzy; our goal is the health extension to be a best effort tool. It may not always show unhealthy when the user would want it to, but it should cover many useful cases. For example, a user might want to the collector to mark itself as unhealthy if the following happens:
- An exporter can not send data due to permissions, networking, or some configuration issue.
- A receiver has not collected any data
- A processor or any component has emitted X failures or has not returned success for Y minutes.
Configurable: The user should have control over which sorts of failures lead to an unhealthy status.

Minimum Viable Solution: Health Extension that integrates with the plugin framework

Our hope is that we could build an extension which simply integrates with the core of the collector. A user can configure criteria and then the collector will be marked as healthy or unhealthy. For example, if any exporter fails X retries then that could be an unhealthy condition.

Presumably this should be easy. The core code of the collector must track errors returned by exporters.

The health monitor must be sophisticated and store state on which components have failed. For example consider the following:

Exporter #1 fails retries => collector marked unhealthy
Exporter #2 fails retries as well => collector still unhealthy
Exporter #1 recovers => collector still unhealthy because exporter #2 is still failing
Exporter #2 recovers => collector now back to healthy

Nice-to-have Proposal 1: Allow plugins to interact with the health monitor

We are trying to figure out if the solution outlined above is sufficient. What if each plugin could directly interact with the health monitor and set the health status as needed?

For example, say could add the following line of code in an error case that you know would be unrecoverable/critical for your users:

monitor.SetUnhealthy()

At the moment, we can not think of any situations which require this... if the health system can monitor errors from components that should be sufficient criteria to mark it as healthy/unhealthy.

Long term work? Health Monitoring for Collector and SDK

This is just an idea that I want to throw out in case its good- could we monitor the health of the SDK + Collector as a system?

For example, may be the SDK and collector can establish a heart beat to verify their connection. And if that breaks, then the collector could be configured to mark itself unhealthy. This covers you in cases where the SDK isn’t sending any data because there’s nothing to send; the heart beat lets you know that this is safe/expected, rather than indicating a problem.

PettitWesley commented 3 years ago

CC @JohnWu20 who will be working on this with me

bogdandrutu commented 3 years ago

Couple of questions:

Who is the consumer of this? Human or the machine?
What would be the protocol to expose this? Would a health per component be enough?

PettitWesley commented 3 years ago

@bogdandrutu I'm thinking the consumer would be the same as the existing health check extension. As in, it can just be an http endpoint on the collector that returns 200 or not to show the status. Just that now the criteria for healthy is more sophisticated. You can use it for your container health check. And then the health status can be consumed by humans or machines.

I think the current health extension is almost entirely useless/meaningless... the idea is that instead of some simple and relatively opaque "healthy means the collector is 'alive'", users can instead configure a custom health check criteria such that "collector is healthy" == data is being sent to my destination.

Would a health per component be enough?

I think the code probably needs to track the health of each component. The user should be exposed with a single collector health status I think. It might also be useful to let the health system log a message every time a components health changes just in case users are interested in diving deeper.

bogdandrutu commented 3 years ago

users can instead configure a custom health check criteria such that "collector is healthy" == data is being sent to my destination.

What are the available input data for this conditions?

FYI: Trying to better understand what we need to build in order to figure out what code changes we need :)

PettitWesley commented 3 years ago

@bogdandrutu I don't know the core functionality of the collector... but the core of the code must track errors from each plugin?

I came to OT from Fluent Bit, it has an "engine" which manages the plugins and sends them data and tracks their success/error/retry status for processing that data. At least for processors/exporters that's how it works. For receivers I'm a bit fuzzy on how that works, but you could definitely track if a receiver has sent you any data.

With those data inputs, you can build a health criteria. You can say for example, that for me healthy means OTLP receiver is uptaking data and the AWS X-Ray exporter is not returning errors for sends. (The receiver part needs some sort of timeout for how long not receiving data is "bad". Don't focus too much on the receiver use case actually... my thinking is that this is mainly useful for exporters).

bogdandrutu commented 3 years ago

but the core of the code must track errors from each plugin?

Yes I am thinking about something in these lines, expand Component interface to expose a getHealth() functionality which every component can implement.

We do have a "service" which manages the lifetime of all the "Components" so if every component exposes the health we can then make "Service" expose this smart health status.

But I am also thinking how a helper could be implemented to allow components to not have to implement all these "customizations".

PettitWesley commented 3 years ago

@bogdandrutu I have added this on the agenda for the next collector SIG meeting.

expand Component interface to expose a getHealth() functionality which every component can implement.

That might be ideal to allow each component to implement its own logic for setting its health... but I think there is also a way to implement it that does not require any component to implement anything new.

Again I don't know the collector internals deeply enough, but I saw here: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#data-ownership

Each exporter implements ConsumeTraces/ConsumeMetrics/ConsumeLogs- AFAICT each of those functions returns an error- does that error indicate whether they succeeded or not? If it does, then I presume there's some central code that manages calling those functions for each component as needed- could the health system integrate there and just track errors returned by components?

PettitWesley commented 3 years ago

Actually we'll probably have to discuss this at next week's Collector SIG, just realized I have a conflict tomorrow morning.

bogdandrutu commented 3 years ago

@PettitWesley for a rough idea, look into this packet https://github.com/open-telemetry/opentelemetry-collector/tree/main/obsreport we called this from every component.

PettitWesley commented 3 years ago

todo: Need to check out the zpages extension, it has much of the data we need, may be this health check is an enhancement to it?

https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/zpagesextension

PettitWesley commented 3 years ago

@bogdandrutu Please send us the details when you get a chance on how to use obsreport as discussed in the SIG:

Get the number of successes/failure calls for a exporter, not the number of metrics/traces/logs that failed/succeeded to be sent.
Configure an exporter in opencensus to consume these metrics in the code

bogdandrutu commented 3 years ago

See this about adding views https://github.com/open-telemetry/opentelemetry-collector/issues/3039. Also see https://github.com/census-instrumentation/opencensus-go#views

What you need to do is to register a view for the "send_failed_spans" measure with the aggregation Count instead of Sum.

bogdandrutu commented 3 years ago

For exporter, see the example https://github.com/census-instrumentation/opencensus-go/blob/master/examples/gauges/gauge.go#L150 you can implement that interface and keep values in memory instead of in the logs.

rakyll commented 3 years ago

A naive comment about health check without much context about the past discussions.

When we think about health, we usually think about the case where the collector process is available and ready to accept incoming requests. We should tackle https://github.com/open-telemetry/opentelemetry-collector/issues/3002 in order to determine whether collector is ready to "accept" requests at the OTLP endpoints.

@PettitWesley I can help with the OC views and export if you need a hand.

skyduo commented 3 years ago

Hey @bogdandrutu , we have a design doc for the health check for OT and a few questions include in it, may need you to take a look and approve the design. https://docs.google.com/document/d/1SpUMsWA2DeaoVazeQ8uEc1Wvu5LphmQU_TjzLmuJ4QM/edit#

skyduo commented 3 years ago

Hi @bogdandrutu ! As we want to add the awsecshealthcheckextension to the opentelemetry-collector-contrib repo due to this is AWS ECS specific extension, there is a issue that the obsreportconfig is located inside the internal package in the opentelemetry-collector, so i cannot get the obsreportconfig/obsmetrics which is need to build our own view, we have some options like:

could we move the obsreportconfig out of the internal folder so that we can directly use it? if you agree with this, it will be easy for us to implement
if we cannot directly use it, then we have to rewrite a lot of duplicate code in the obsreportconfig, which i don't think this pr is good to merge if write a lot of code same as the existing one.

so could you help us on that?

bogdandrutu commented 3 years ago

could we move the obsreportconfig out of the internal folder so that we can directly use it? if you agree with this, it will be easy for us to implement

Let's craft a small PR and add the views that you need in core, see how large that change is. Is that enough? Would that resolve your issue?

skyduo commented 3 years ago

Hey @bogdandrutu ! do you mean add our new view into the obsreportconfig in core? i'm not sure if my understand is correct but seems like the new view will still located in the internal folder, right? we still cannot access to it

open-telemetry / opentelemetry-collector