Open ngc104 opened 3 years ago
I don't know how Alertmanager internals work but I guess that silences may be on one alertmanager and not the other. Both communicate to know if an alert is silenced.
Silences are gossiped between the two Alertmanagers when clustered, so both will have the same silences.
OK. So we have another mystery... (I won't open an issue for that : I don't even know how to reproduce it and we have had no problem with that... ) Let's focus on the green line...
What did you do?
I'm using Kthxbye. When an alert fires and I add a silence with Kthxbye, the memory usage of Alertmanager increases.
You can reproduce this without Kthxbye :
1/ generate an alert (or use any alert sent by Prometheus), for example
PrometheusNotIngestingSamples
.2/ With Alertmanager, generate silences like this :
Note : The behaviour of Kthxbye is similar, but default config is 15 min instead of 1 min. However, with amtool you can see that Kthxbye has nothing to do with this bug.
What did you expect to see?
Nothing interesting (no abnormal memory increase)
What did you see instead? Under which circumstances?
Follow the metric
container_memory_working_set_bytes
for Alertmanager. After some hours you can see it slightly grow up.Here is a screenshot of the above test, for a little more than 12 hours : test started at 12h20 and finished at 9h the day after.
My Alertmanager is running with the default
--data.retention=120h
. I guessed that after 5 days it would stop increasing. Wrong guess : it stops increasing only at OOM and automatic kill.The above graph was made with Kthxbye running. The pod restarts after an OOM (left side) or after a
kubectl delete pod
(right side).Environment
System information:
Kubernetes (deployed with https://github.com/prometheus-community/helm-charts/tree/main/charts/alertmanager)
Alertmanager version:
➜ k -n monitoring logs caascad-alertmanager-0 level=info ts=2021-07-30T09:09:46.139Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)" level=info ts=2021-07-30T09:09:46.139Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)" level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml level=info ts=2021-07-30T09:09:46.174Z caller=main.go:485 msg=Listening address=:9093 level=warn ts=2021-07-30T12:29:49.530Z caller=notify.go:674 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://xxxx.rocketchat.xxx/hooks/xxxxxx/xxxxxxxxx\": dial tcp x.x.x.x: connect: connection refused" level=info ts=2021-07-30T12:32:17.213Z caller=notify.go:685 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify success" attempts=13