grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.45k stars 216 forks source link

Expose Alloy overall component health via HTTP endpoint #2061

Open thampiotr opened 2 weeks ago

thampiotr commented 2 weeks ago

Request

Expose via an HTTP endpoint the overall health of all running components: if at least one component is not healthy - return an error.

For example, if I request

GET /-/ready?strict=true

And there is one component failing, we should get back:

503: Component 'foo.bar.baz' is not healthy: 'error message'

Use case

Sometimes users may want to set up a liveness or readiness probe on Alloy in k8s that does not only check if Alloy runtime is started, but also checks that all the components are healthy.

Currently Alloy's /-/ready endpoint will return 200 even if there are components failing to start. This can be useful when we have multiple pipelines and want the healthy ones to continue running. But some users may prefer a "fail hard" behaviour.

DWebb0 commented 6 days ago

I'm registering my interest here. Form3 are exploring the Victoria Metrics agent as a workaround.