Open alex-berger opened 1 week ago
No 100% sure yet, but it looks like I can provoke this by updating the configuration (ConfigMap), the resulting config reload seems to trigger the above described behavior. In my case I am simply changing the logging.level
back and forth from info
to warn
and vise versa to make sure the config changes. It's still racy as not all Pods from the DaemonSet become nonfunctional, but some do.
As a work-around we added a livenessProbe
to the Alloy DaemonSet, which will make sure the alloy container is restarted if it (the metrics endpoint) becomes unhealthy (starts responding with status code 5XX
).
containers:
- name: alloy
livenessProbe:
httpGet:
path: /metrics
port: 12345
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 2
periodSeconds: 30
successThreshold: 1
failureThreshold: 3
What's wrong?
A few days ago, suddenly some of my Alloy DameonSet Pods stopped working properly, resulting in server errors (500) when accessing their /metrics endpoint. The DaemonSet was still running, no container crashes/restarts (readinessProbe was doing fine), but it appeared that it was no longer properly working.
Looks like this was caused by errors of the kind
was collected before with the same name and label values
. I have not observed this before and after restarting the Pods the problem disappeared. Therefore, I suspect that there must be some kind of race-condition bug in Alloy. Here an example of the output from the /metrics endpoint.What immediately caught my attention is, that the affected metrics are all from stage.metrics blocks (inside a loki.process component). See below configuration for more details.
Steps to reproduce
This happens sporadically once in a while and I have not yet figured out how to reliably reproduce it. That's also why I suspect that this is a kind of a "race condition" bug.
System information
Linux 6.1.92, amd64 and arm64, Bottlerocket OS 1.20.3 (aws-k8s-1.29)
Software version
Grafana Alloy 1.2.0
Configuration
Logs