grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.43k stars 211 forks source link

[Feature Request] Alloy container health check #1891

Open gregbrowndev opened 1 month ago

gregbrowndev commented 1 month ago

Request

Hi,

The standard OTel collector has a health check extension that can be used in deployments to restart the container if it fails:

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckv2extension/README.md

I understand the v2 extension is still experimental, and the v1 is being deprecated due to limitations. However, would still be useful to see if the container had failed. I haven't yet set up the Alloy integration with Grafana Cloud, but I will be doing so soon.

Is there a plan to have a native health check and ready check we can use for deploying Alloy containers?

Thanks!

Use case

As a new Grafana Cloud user, I've been setting up an observability solution with metrics, traces, and logs.

I requested to enable native histogram support in Grafana, but this seemed to break my Alloy configuration without any error why:

ts=2024-10-15T13:11:36.774541357Z level=info msg="node exited without error" node=prometheus.remote_write.metrics

This seemed to break the entire Alloy server, I wasn't able to get logs or traces due to the broken prometheus component.

Reverting the send_native_histograms to false fixed the issue:

prometheus.remote_write "metrics" {
    // Exports metrics to Prometheus backend
    // https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.remote_write/
    endpoint {
        url                    = env("PROMETHEUS_SERVER_URL")
        send_native_histograms = false

        tls_config {
            insecure_skip_verify = env("TLS_ENABLED") == "false"
        }

        basic_auth {
            username = env("PROMETHEUS_USERNAME")
            password = env("PROMETHEUS_PASSWORD")
        }
    }
}

However, the deployment succeeded as the container wasn't detected as unhealthy.

github-actions[bot] commented 2 days ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!