grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.31k stars 180 forks source link

Improve visibility of failed config reloads #304

Open thampiotr opened 10 months ago

thampiotr commented 10 months ago

Request

Currently when using Agent via helm chart with config-reloader, when I provide invalid config, the following logs can be observed:

config-reloader 2023/11/17 13:27:25 config map updated
config-reloader 2023/11/17 13:27:25 performing webhook request (1/1)
grafana-agent ts=2023-11-17T13:27:25.272090463Z level=info msg="reload requested via /-/reload endpoint" service=http
grafana-agent ts=2023-11-17T13:27:25.273812047Z level=info msg="config reloaded" service=http
config-reloader 2023/11/17 13:27:25 error: Received response code 400 , expected 200
grafana-agent ts=2023-11-17T13:27:26.701242214Z level=info msg="rejoining peers" peers=10-244-1-9.agent-grafana-agent-cluster.agent.svc.cluster.local.:80,10-244-2-9.agent-grafana-agent-cluster.agent.svc.cluster.local.:80,10-244-3-8.agent-grafana-agent-cluster.agent.svc.cluster.local.:80
config-reloader 2023/11/17 13:27:35 error: Webhook reload retries exhausted

(the logs from config-reloader/grafana-agent are prefixed as such)

There are two problems I see with this:

Version used: docker.io/grafana/agent:v0.37.4

Proposed change

Use case

As a user, when I make changes to the configuration, I'd like to receive feedback when something is incorrect. While it's possible to test config locally, some changes may still fail when deployed to k8s cluster and updated via /-/reload. A failed reload doesn't break the agents, but fails rather silently from the agent's logs POV. Any subsequent changes in the cluster will result in pods crashlooping as they don't have the previous valid version of the config.

hainenber commented 10 months ago

I'd love to tackle this. Can you assign me? Thanks

github-actions[bot] commented 9 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

ptodev commented 7 months ago

Hi @hainenber! Thank you for your interest and apologies that your message was missed. Are you still interested in this? @kurczynski added a log line in grafana/agent#6283, but if you would like to, you could add a metric (unless we already have a similar metric)?