Closed PettitWesley closed 1 year ago
CC @JohnWu20 who will be working on this with me
Couple of questions:
@bogdandrutu I'm thinking the consumer would be the same as the existing health check extension. As in, it can just be an http endpoint on the collector that returns 200 or not to show the status. Just that now the criteria for healthy is more sophisticated. You can use it for your container health check. And then the health status can be consumed by humans or machines.
I think the current health extension is almost entirely useless/meaningless... the idea is that instead of some simple and relatively opaque "healthy means the collector is 'alive'", users can instead configure a custom health check criteria such that "collector is healthy" == data is being sent to my destination.
Would a health per component be enough?
I think the code probably needs to track the health of each component. The user should be exposed with a single collector health status I think. It might also be useful to let the health system log a message every time a components health changes just in case users are interested in diving deeper.
users can instead configure a custom health check criteria such that "collector is healthy" == data is being sent to my destination.
What are the available input data for this conditions?
FYI: Trying to better understand what we need to build in order to figure out what code changes we need :)
@bogdandrutu I don't know the core functionality of the collector... but the core of the code must track errors from each plugin?
I came to OT from Fluent Bit, it has an "engine" which manages the plugins and sends them data and tracks their success/error/retry status for processing that data. At least for processors/exporters that's how it works. For receivers I'm a bit fuzzy on how that works, but you could definitely track if a receiver has sent you any data.
With those data inputs, you can build a health criteria. You can say for example, that for me healthy means OTLP receiver is uptaking data and the AWS X-Ray exporter is not returning errors for sends. (The receiver part needs some sort of timeout for how long not receiving data is "bad". Don't focus too much on the receiver use case actually... my thinking is that this is mainly useful for exporters).
but the core of the code must track errors from each plugin?
Yes I am thinking about something in these lines, expand Component
interface to expose a getHealth()
functionality which every component can implement.
We do have a "service" which manages the lifetime of all the "Components" so if every component exposes the health we can then make "Service" expose this smart health status.
But I am also thinking how a helper could be implemented to allow components to not have to implement all these "customizations".
@bogdandrutu I have added this on the agenda for the next collector SIG meeting.
expand Component interface to expose a getHealth() functionality which every component can implement.
That might be ideal to allow each component to implement its own logic for setting its health... but I think there is also a way to implement it that does not require any component to implement anything new.
Again I don't know the collector internals deeply enough, but I saw here: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#data-ownership
Each exporter implements ConsumeTraces/ConsumeMetrics/ConsumeLogs- AFAICT each of those functions returns an error- does that error indicate whether they succeeded or not? If it does, then I presume there's some central code that manages calling those functions for each component as needed- could the health system integrate there and just track errors returned by components?
Actually we'll probably have to discuss this at next week's Collector SIG, just realized I have a conflict tomorrow morning.
@PettitWesley for a rough idea, look into this packet https://github.com/open-telemetry/opentelemetry-collector/tree/main/obsreport we called this from every component.
todo: Need to check out the zpages extension, it has much of the data we need, may be this health check is an enhancement to it?
https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/zpagesextension
@bogdandrutu Please send us the details when you get a chance on how to use obsreport as discussed in the SIG:
See this about adding views https://github.com/open-telemetry/opentelemetry-collector/issues/3039. Also see https://github.com/census-instrumentation/opencensus-go#views
What you need to do is to register a view for the "send_failed_spans" measure with the aggregation Count instead of Sum.
For exporter, see the example https://github.com/census-instrumentation/opencensus-go/blob/master/examples/gauges/gauge.go#L150 you can implement that interface and keep values in memory instead of in the logs.
A naive comment about health check without much context about the past discussions.
When we think about health, we usually think about the case where the collector process is available and ready to accept incoming requests. We should tackle https://github.com/open-telemetry/opentelemetry-collector/issues/3002 in order to determine whether collector is ready to "accept" requests at the OTLP endpoints.
@PettitWesley I can help with the OC views and export if you need a hand.
Hey @bogdandrutu , we have a design doc for the health check for OT and a few questions include in it, may need you to take a look and approve the design. https://docs.google.com/document/d/1SpUMsWA2DeaoVazeQ8uEc1Wvu5LphmQU_TjzLmuJ4QM/edit#
Hi @bogdandrutu ! As we want to add the awsecshealthcheckextension to the opentelemetry-collector-contrib repo due to this is AWS ECS specific extension, there is a issue that the obsreportconfig is located inside the internal package in the opentelemetry-collector, so i cannot get the obsreportconfig/obsmetrics which is need to build our own view, we have some options like:
so could you help us on that?
could we move the obsreportconfig out of the internal folder so that we can directly use it? if you agree with this, it will be easy for us to implement
Let's craft a small PR and add the views that you need in core, see how large that change is. Is that enough? Would that resolve your issue?
Hey @bogdandrutu ! do you mean add our new view into the obsreportconfig in core? i'm not sure if my understand is correct but seems like the new view will still located in the internal folder, right? we still cannot access to it
Our goal is to make it easier to “observe the observers”
Currently the collector has a health extension: https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/healthcheckextension
However it is very rudimentary. We want to build a more sophisticated health extension that will emit a health status based on whether or not the collector is functioning properly and achieving the user’s end goal.
Principles/Goals
Minimum Viable Solution: Health Extension that integrates with the plugin framework
Our hope is that we could build an extension which simply integrates with the core of the collector. A user can configure criteria and then the collector will be marked as healthy or unhealthy. For example, if any exporter fails X retries then that could be an unhealthy condition.
Presumably this should be easy. The core code of the collector must track errors returned by exporters.
The health monitor must be sophisticated and store state on which components have failed. For example consider the following:
Nice-to-have Proposal 1: Allow plugins to interact with the health monitor
We are trying to figure out if the solution outlined above is sufficient. What if each plugin could directly interact with the health monitor and set the health status as needed?
For example, say could add the following line of code in an error case that you know would be unrecoverable/critical for your users:
At the moment, we can not think of any situations which require this... if the health system can monitor errors from components that should be sufficient criteria to mark it as healthy/unhealthy.
Long term work? Health Monitoring for Collector and SDK
This is just an idea that I want to throw out in case its good- could we monitor the health of the SDK + Collector as a system?
For example, may be the SDK and collector can establish a heart beat to verify their connection. And if that breaks, then the collector could be configured to mark itself unhealthy. This covers you in cases where the SDK isn’t sending any data because there’s nothing to send; the heart beat lets you know that this is safe/expected, rather than indicating a problem.