Open ngc104 opened 3 years ago
Thanks for your report.
Would you be able to take and upload a profile with http://alertmanager/debug/pprof/heap when the memory is high and provide it to us ?
Hello, I'm on holidays since today... Too bad...
However I'm not alone on this issue. You should have the profile from someone else soon (let us wait some hours to reproduce the bug again)
Regards
Hi @roidelapluie
I'm in @ngc104 's team. During his holidays I will follow this issues. I tried to reproduce the bug yesterday but unfortunatelly the pod was restarted before I was able to extracted the profile and the heap. I'll try to provide you those informations as soon as I can :)
Thx for your help
Regards,
Hi guys
Any update on this case ? Tell me if you're need mode informations
Regards,
Hi @roidelapluie
Can we have an update on this issues ? Do you need more informations ?
Regards
Hi everyone
Will it be possible to have on update on this case ? Could become critical on our side
Regards,
Hi everyone, @roidelapluie
Thanks for your report.
Would you be able to take and upload a profile with http://alertmanager/debug/pprof/heap when the memory is high and provide it to us ?
I sent you the element requested in the begining of august. Could you help us on this case please ?
Regards Winael
Hello,
Somehow the heap dump does not seem to help. I see only a few megabytes of memory used here. I will try to reproduce this on my side asap.
Hi @roidelapluie
Have you succedded to reproduce this issue on your side ? Do you have some informations that could help us please ?
Regards, Winael
i did not have the chance yet. maybe you can come up with a pprof that shows the issue ? the one that was sent previously was not really conclusive.
Edited on 13/06/2023 (typo and switch to 0.25.0) Edited on 29/08/2023 (switch to 0.26.0 + run alertmanager --version to show the version + added trace.sh)
(yes, sorry, so complex for just curling a tar.gz and uncompress it)
ALERTMANAGER_VERSION="0.26.0"
ALERTMANAGER_ARCHIVE="alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
ALERTMANAGER_URL="https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/${ALERTMANAGER_ARCHIVE}"
# Download
[ -e "${ALERTMANAGER_ARCHIVE}" ] || curl -sL "${ALERTMANAGER_URL}" -o "${ALERTMANAGER_ARCHIVE}"
[ -d "alertmanager" ] || (tar xzf "${ALERTMANAGER_ARCHIVE}" ; mv "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64" "alertmanager")
➜ cat <<EOF > alertmanager.yml
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'mytest'
receivers:
- name: 'mytest'
EOF
./alertmanager/alertmanager --version > version.txt
./alertmanager/alertmanager
Alertmanager is working : http://localhost:9093
while true; do
date;
./alertmanager/amtool --alertmanager.url=http://localhost:9093/ alert add alertname=foo node=bar
./alertmanager/amtool --alertmanager.url=http://localhost:9093/ silence add -a "awesome user" -d 1m -c "awesome comment" alertname=foo
sleep 60
done
The first amtool
call will send an alert to Alertmanager just like Prometheus would do it.
The second amtool
call will set a silence on that alert just like kthxbye would do it.
Difference with Prometheus/Kthxbye : we harass Alertmanager every 1 minute.
DATE=$(date +'%Y%m%d_%H%M%S')
FILEPREFIX="profile-${DATE}"
go tool pprof -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs.pdf"
We can also play with go tool pprof -http localhost:12345 http://localhost:9093/debug/pprof/allocs
➜ cat <<'EOF' > trace.sh
DATE=$(date +'%Y%m%d_%H%M%S')
REPORTDIR="profiles"
mkdir -p "${REPORTDIR}"
FILEPREFIX="${REPORTDIR}/profile-${DATE}"
export PPROF_TMPDIR="$(mktemp -d /tmp/pprof-XXXXXX)"
echo "Traces in '${PPROF_TMPDIR}'"
echo "=======================================" >> "${REPORTDIR}/report.txt"
echo "${DATE}" >> "${REPORTDIR}/report.txt"
./alertmanager/alertmanager --version >> "${REPORTDIR}/report.txt"
go tool pprof -raw http://localhost:9093/debug/pprof/allocs > /dev/null
mv "${PPROF_TMPDIR}/pprof.alertmanager.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz" "${FILEPREFIX}.pb.gz"
go tool pprof -sample_index=alloc_objects -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=alloc_space -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=inuse_objects -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=inuse_space -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
rmdir "${PPROF_TMPDIR}"
EOF
Then run "./trace.sh" from times to times
I'm back on this issue. I harassed Alertmanager for a workday.
Here are some pdf (I don't know if it is the preferred format, sorry)
09h19 profile-20220113_091951-allocs.pdf 10h59 profile-20220113_105909-allocs.pdf 16h55 profile-20220113_165542-allocs.pdf
09h19 4Mb (nodes accounting for 33245.19kB, 78.29% of 42466.72kB total
)
10h59 75Mb (nodes accounting for 56.61MB, 75.30% of 75.18MB total
)
16h55 173Mb (nodes accounting for 134.34MB, 77.30% of 173.80MB total
)
Because I can now reproduce the bug easily, please tell me what kind of pprof traces you need to investigate.
Hello @roidelapluie I'm back on this issue (replacing @Winael ). If you can come back on it too (or anybody else), please tell me how to take pprof traces if you need more infos.
Note : I took heap traces too. All the day. On the PDF, I can read Showing nodes accounting for 3609.71kB, 100% of 3609.71kB total
at the beginning and at the end of the day. Heap seems to be stable.
I was going through the profiles and notice that you are using alloc_space which is about how much memory space was allocated. Can you inuse_space and check if the memory consumption is going up.
@prashbnair not sure to understand... You want me to run go tool pprof -sample_index=inuse_space
instead of just go tool pprof
?
I will try that ASAP.
Are PDF files the best output for you ?
Starting taking the traces. They will be posted later today.
Here is my script to take traces :
#! /bin/bash
DATE=$(date +'%Y%m%d_%H%M%S')
FILEPREFIX="profile-${DATE}"
go tool pprof -sample_index=inuse_space -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-inuse_space.pdf"
go tool pprof -sample_index=inuse_objects -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-inuse_objects.pdf"
go tool pprof -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-alloc_space.pdf"
go tool pprof -sample_index=alloc_objects -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-alloc_objects.pdf"
I restarted my test (as described in a previous comment above).
I took some traces right after I started Alertmanager and my loop for the alert, at 09:19. I took other traces after lunch at 14:02. And I ended my test at 17:58.
I tried the 4 sample indexes I found for allocations. I hope that it is what you wanted @prashbnair .
cc @roidelapluie
@ngc104 By looking at the inuse_space I notice that it does not go up over time. Are you still able to reproduce the error?
@prashbnair Yes.
In fact, the main problem is Kubernetes killing the Alertmanager pod container when going OOM because of the "memory" going higher than the resource limit.
What is the "memory" measured by Kubernetes to kill the container ? If this is not alloc_space, I don't know how to measure it with go pprof.
Restarted my test at 9h22 UTC. Here is some interesting result at 10h03 UTC...
➜ ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ silence query --expired
ID Matchers Ends At Created By Comment
2fca38e1-9dc3-4fbf-8f35-291dcac02a34 alertname="foo" 2022-05-02 09:22:11 UTC awesome user awesome comment
8ab24f44-c779-4fb7-8ba5-da9061b635d0 alertname="foo" 2022-05-02 09:23:11 UTC awesome user awesome comment
7347ceff-527c-4106-8dc9-861d8205b374 alertname="foo" 2022-05-02 09:24:11 UTC awesome user awesome comment
26299122-2d9c-44dc-bc46-fa945379f4d7 alertname="foo" 2022-05-02 09:25:11 UTC awesome user awesome comment
7d27178d-59ad-443f-81dd-e9351b6a9701 alertname="foo" 2022-05-02 09:26:11 UTC awesome user awesome comment
0219f67a-507a-4b4a-99e4-bf8020fe6615 alertname="foo" 2022-05-02 09:27:11 UTC awesome user awesome comment
73173814-25d6-4181-8de7-0677b8011fca alertname="foo" 2022-05-02 09:28:11 UTC awesome user awesome comment
82cf631a-07c3-4f90-8131-e7d4ade49d35 alertname="foo" 2022-05-02 09:29:11 UTC awesome user awesome comment
37978ad4-9c1f-4ad7-a6af-d75e33baf75b alertname="foo" 2022-05-02 09:30:11 UTC awesome user awesome comment
dc17d617-ac31-460a-949f-4021fd945374 alertname="foo" 2022-05-02 09:31:11 UTC awesome user awesome comment
41f5a1b8-ce3a-4a85-a118-672b0dce2267 alertname="foo" 2022-05-02 09:32:11 UTC awesome user awesome comment
d0e8b085-fe67-49dc-9ee2-96e6161a7d0d alertname="foo" 2022-05-02 09:33:12 UTC awesome user awesome comment
222bae9c-7b07-4a80-997b-06867411d6c8 alertname="foo" 2022-05-02 09:34:12 UTC awesome user awesome comment
025cd910-b437-4455-b925-2744f7a5e027 alertname="foo" 2022-05-02 09:35:12 UTC awesome user awesome comment
0f6dcfc7-8320-4bfb-aab5-fc79fe661546 alertname="foo" 2022-05-02 09:36:12 UTC awesome user awesome comment
3ca66616-f985-4e7c-b60f-7526fa0e7ae6 alertname="foo" 2022-05-02 09:37:12 UTC awesome user awesome comment
0c2ef5b5-5e82-4823-8453-e223f18289a7 alertname="foo" 2022-05-02 09:38:12 UTC awesome user awesome comment
7abe6854-ab5e-4fc8-bd6e-a7495ef8b366 alertname="foo" 2022-05-02 09:39:12 UTC awesome user awesome comment
41a55f9a-b330-494f-a842-8e97e740fb27 alertname="foo" 2022-05-02 09:40:12 UTC awesome user awesome comment
8c361742-e1c0-44c3-8d7b-530771335458 alertname="foo" 2022-05-02 09:41:12 UTC awesome user awesome comment
e95a41c6-91f8-45e2-9bcb-aa21d32fb60e alertname="foo" 2022-05-02 09:42:12 UTC awesome user awesome comment
6016bdfa-2638-47eb-81b6-502f2317a29f alertname="foo" 2022-05-02 09:43:12 UTC awesome user awesome comment
11bb8940-1865-4512-8dd2-248535f02997 alertname="foo" 2022-05-02 09:44:12 UTC awesome user awesome comment
51853d4f-a9cc-450c-bc4c-84d40975be83 alertname="foo" 2022-05-02 09:45:12 UTC awesome user awesome comment
200dcc2e-5b5d-4ed8-80cd-3b87c93d35f6 alertname="foo" 2022-05-02 09:46:12 UTC awesome user awesome comment
fec736cf-ff1d-4a03-95e9-9615b834eba6 alertname="foo" 2022-05-02 09:47:12 UTC awesome user awesome comment
1aac8602-7c30-4eda-a37b-a8e54bcf6252 alertname="foo" 2022-05-02 09:48:12 UTC awesome user awesome comment
4b3abc68-5621-400f-9c56-65b05980c568 alertname="foo" 2022-05-02 09:49:12 UTC awesome user awesome comment
1a3276b4-1f61-4df4-8baf-a98906b972f5 alertname="foo" 2022-05-02 09:50:12 UTC awesome user awesome comment
b06f383d-a4e5-45ae-bf09-0b7cb32a2665 alertname="foo" 2022-05-02 09:51:12 UTC awesome user awesome comment
9e6a8329-e395-4578-909e-111f22c9cb8f alertname="foo" 2022-05-02 09:52:13 UTC awesome user awesome comment
ebcd1791-cb78-48d6-bd46-db768cb42476 alertname="foo" 2022-05-02 09:53:13 UTC awesome user awesome comment
c0f29585-db7a-4560-90d2-825fc99b819a alertname="foo" 2022-05-02 09:54:13 UTC awesome user awesome comment
0efb0aca-4471-41d7-8daf-a9be279bac23 alertname="foo" 2022-05-02 09:55:13 UTC awesome user awesome comment
a6d2db55-c5c8-4e6e-9eed-e1ec87da51e5 alertname="foo" 2022-05-02 09:56:13 UTC awesome user awesome comment
90bc593e-0518-479a-8c02-4072bc0c4421 alertname="foo" 2022-05-02 09:57:13 UTC awesome user awesome comment
49c4bc3b-57a3-4d6e-8bd8-8166addb18fa alertname="foo" 2022-05-02 09:58:13 UTC awesome user awesome comment
971734d2-c8ad-40bd-a27c-95c592d65efc alertname="foo" 2022-05-02 09:59:13 UTC awesome user awesome comment
c773ba25-af00-4045-9a70-b63e75ecd5a8 alertname="foo" 2022-05-02 10:00:13 UTC awesome user awesome comment
fb8f23e9-32ad-4ab4-b449-cf86e34c5917 alertname="foo" 2022-05-02 10:01:13 UTC awesome user awesome comment
4e95d124-d052-403d-b5e6-86d0acc22c17 alertname="foo" 2022-05-02 10:02:13 UTC awesome user awesome comment
Could the "memory leak" be something else than a memory leak ? Could it be the sum of expired silences not being flushed ?
I was bad on this one... I was exploring data.retention in the Production environment. We have data.retention unset (e.g. using the default value 120h / 5 days).
With amtool, I notice that I have a lot of silences. But no silence after 5 days. There is no alerts except the 2 or 3 active and suppressed ones (3 alerts won't create a memory leak, right ?)
When I restart Alertmanager, I expect to have a memory consumption that goes continuously growing for 5 days. Then it should become flat. But I observe that it continues to grow after 5 days. So my guess was bad : this is not the sum of expired silences being flushed.
I'm thinking again about a memory leak. But no idea on how to debug it.
Hello and happy new year !
Alertmanager-0.25.0 is released. Let's give a try... Same procedure as above (in a previous comment)
12h37 profile-20230106_155103-allocs.pdf 14h06 profile-20230106_140612-allocs.pdf 16h55 profile-20230106_123718-allocs.pdf
12h37 14Mb (nodes accounting for 14467.52kB, 100% of 14467.52kB total
)
14h06 43Mb (nodes accounting for 27635.94kB, 63.32% of 43644.80kB total
)
16h55 66Mb (nodes accounting for 43.69MB, 65.99% of 66.21MB total
)
Nothing seems to have changed in 0.25.0 related to this issue.
Hello,
I had not noticed that I was running a standalone alertmanager in cluster mode. In order to remove any noise in the investigation, I retried with this command line :
./alertmanager/alertmanager --cluster.listen-address=""
Then I took some traces as usual :
10h56 : profile-20230113_105632-allocs.pdf 11h34 : profile-20230113_113413-allocs.pdf 14h01 : profile-20230113_140152-allocs.pdf 14h44 : profile-20230113_144421-allocs.pdf 18h58 : profile-20230113_185807-allocs.pdf
10h56 : 15Mb (nodes accounting for 15216.73kB, 100% of 15216.73kB total
)
11h34 : 23Mb (nodes accounting for 17373.59kB, 73.87% of 23519.01kB total
)
14h01 : 47Mb (nodes accounting for 34402.62kB, 71.93% of 47825.89kB total
)
14h44 : 56Mb (nodes accounting for 40.78MB, 72.40% of 56.33MB total
)
18h58 : 99Mb (Showing nodes accounting for 72MB, 72.58% of 99.20MB total
)
I hope these traces are easier to use without the noise of the data structures needed for the absent cluster.
:wave: it would help if you could share the raw pprof flies.
Hello,
Sorry for the delay... Too much work besides this incident. And I'm not sure that the "raw" pprof files are obtained just with -raw
option instead of -pdf
. Please give me more info if the following does not fit.
And I forgot to start Alertmanager with --cluster.listen-address=""
. So this one was a mono-cluster...
Here are the logs : profiles.zip
Here is my script to generate them :
#! /bin/bash
DATE=$(date +'%Y%m%d_%H%M%S')
mkdir -p profiles
FILEPREFIX="profiles/profile-${DATE}"
go tool pprof -sample_index=inuse_space -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-inuse_space.raw"
go tool pprof -sample_index=inuse_objects -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-inuse_objects.raw"
go tool pprof -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-alloc_space.raw"
go tool pprof -sample_index=alloc_objects -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-alloc_objects.raw"
Edit : I also upload binary files (I have no idea how they work, but I guess you like them too) : pprof.zip
cc @simonpasquier
I missed the fact that you're using version 0.21.0. Can you try with v0.25.0?
All tests were done with latest version.
In 2021 when I created the issue it was tested with 0.21.0. Latest test in 2023 (june 13th) was with 0.25.0.
Every time a new version was released, I tested again with the hope it was solved. But with 0.25 it is not solved yet.
Hello,
Alertmanager 0.26 is released :partying_face:
The bug is still there :bomb:
I modified my comment above with the procedure to reproduce the bug (I changed the way to run go tool pprof
).
And here are new profiles.zip files.
Below is an extract of the report.txt
file that I added in the zip file :
alertmanager, version 0.26.0 (branch: HEAD, revision: d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d)
build user: root@df8d7debeef4
build date: 20230824-11:11:58
go version: go1.20.7
platform: linux/amd64
tags: netgo
...
=======================================
20230829_114408
Showing nodes accounting for 155789, 99.48% of 156600 total
Showing nodes accounting for 34175.86kB, 100% of 34175.86kB total
=======================================
20230829_124408
Showing nodes accounting for 293671, 98.51% of 298100 total
Showing nodes accounting for 52.35MB, 100% of 52.35MB total
=======================================
20230829_134408
Showing nodes accounting for 376318, 96.80% of 388746 total
Showing nodes accounting for 63.89MB, 100% of 63.89MB total
=======================================
20230829_144408
Showing nodes accounting for 432682, 95.95% of 450947 total
Showing nodes accounting for 81.63MB, 100% of 81.63MB total
=======================================
20230829_154408
Showing nodes accounting for 492675, 94.88% of 519250 total
Showing nodes accounting for 94.59MB, 100% of 94.59MB total
=======================================
20230829_164408
Showing nodes accounting for 666111, 94.82% of 702485 total
Showing nodes accounting for 98.54MB, 86.75% of 113.58MB total
=======================================
20230829_174409
Showing nodes accounting for 734087, 95.01% of 772671 total
Showing nodes accounting for 119.25MB, 89.80% of 132.79MB total
=======================================
20230829_184409
Showing nodes accounting for 819591, 94.48% of 867496 total
Showing nodes accounting for 136.83MB, 91.30% of 149.87MB total
As you can see something is growing... (dates are in format YYYYMMDD_HHMMSS)
cc @simonpasquier
Hi! 👋 I attempted to reproduce this on Alertmanager 0.26.0 using the same script but so far I have not been able to. I ran the script for 1h:
#!/bin/sh
while true; do
./amtool --alertmanager.url=http://127.0.0.1:9093 alert add foo=bar
./amtool --alertmanager.url=http://127.0.0.1:9093 silence add foo=bar -c="" -d 1m
sleep 5
done
Here is go_memstats_heap_objects
and go_memstats_heap_inuse_bytes
:
To see if it was related to data retention not deleting expired silences I ran Alertmanager with --data.retention=5m
.
I was fed up with rebuilding my env test each time I wanted to test if the bug is still there. I also thought it would be good to make it easier for everyone to build it. So I wrote some scripts : https://github.com/ngc104/alertmanager-issue-2659
TL;DR :
./buid_env.sh
./run.sh
But the README.md file is quite short. You can read it.
I can see the bug in Kubernetes environment with container_memory_working_set_bytes
metric. But what is that metric ?
In very short, this metric is close to the RSS metric for a process and comes from Linux cgroups.
This is the metric used by Kubernetes to detect OOMs and kill the containers. For that specific reason, it is important to use exactly that metric and not other similar ones.
After a lot of readings, I decided to
My script is 25 lines long and should work on most Linux (cgroup v1 and v2)... Not so hard to understand and fix if it does not work for you.
Thanks to your graphs @grobinson-grafana I thought I could provide grafs too, and let anyone have their own graphs easily.
So my env test comes from :
Now it's easy to run this stack and provide graphs.
The stack was launched for hours with different data retention.
Alertmanager version : 0.26.0
Start : 10h35 End : 21h10 Duration : 10h35
EDIT 02/02/2024 : we had more metrics than we thought. Here are new and larger graphs (retention 2h)
Start : 10h35 End : 10h27 Duration : 23h08
Start : 10h50 End : 10h20 Duration : 23h30
Start : 11h40 End : 21h00 Duration : 9h20
Note : this is the retention that @grobinson-grafana used for the test 2 months and a half ago.
Note : the metrics have been aggregated with avg_over_time(...[15m])
to remove spikes and see the trends better.
The number of expired silences is mostly the same (except at the beginning of course).
The exception is for dataretention=5m : why does it increase so much 2 hours after the beginning and then stbilize 1h30 after the increase ? Well, that is not the object of this issue. Let's just notice it then forget it for while.
This metric comes from alertmanager itself.
The number of heap objects looks stable (after the growing phase at beginning).
Notice that with dataretention=5m the growing phase looks strange with not growing so much.
This groing phase is 2h with dataretention=2h and is also 2h with dataretention=3h. With dataretention=5m, the strange start will also last something like 3h.
I see no link between this metric and the memory leak.
This is the main metric. This is why the issue was opened.
With dataretention=5m, the issue does not appear. It did not appear on @grobinson-grafana 's test. It does not appear here either. I guess that using dataretention=5m will hide the bug.
With dataretention=2h we can see the Workingset size growing slowly until the "cliff". After the cliff, we have not enough data to say if is a flat infinite step or if it will still grow. Maybe we should run this test again.
EDIT 02/02/2024 : with newer graphs on a larger time window, we see that the Workingset size continues growing. Why are there steps ? No idea.
With dataretention=3h we can clearly see the Workingset size growing slowly but continuously.
The bug is there. But we need retentionsize >= 3h and a lot of time to see it
But remember that we run Alertmanager on Kubernetes for months and at the end, the container is killed because of an OOM.
On this test with dataretention=3h, if memory limit was set to 2.25M the container would have been killed after 20h running. Not before.
cc @simonpasquier (I don't know if you are still working on it or if you can do anything).
If the 3 comments were too long to read, please check my little lab to reproduce the bug : https://github.com/ngc104/alertmanager-issue-2659/tree/main and pay attention to the metric workingset size : this is the metric that causes Kubernetes (in reality, Linux) to OOM-kill Alertmanager.
The memory leak is still there
I'm running https://github.com/ngc104/alertmanager-issue-2659/tree/main over 1 day and a half.
I set the parameter --data.retention=3h
. All other parameters are default ones (except --cluster.listen-address=""
but that does not matter).
During all the test, including after 9h13, the same alert is renewed every one minute with amtool --alertmanager.url=http://localhost:9093/ alert add alertname=foo node=bar
.
The memory goes on growing and growing (slowly). When I stop harassing Alertmanager with silences, the memory goes down significantly.
The strange things:
Note : the alert is set every one minute with amtool alert add
, even after I stopped
The history of this issue is long but the issue has not changed since the beginning. It's still there in 2024 with Alertmanager-0.27.0. And the use-case with Kthxbye is still there, with OOM after one or two months running with this small memory leak.
I'm sorry but I'm still don't see where the memory leak is in your screenshots. This looks perfect to me? In the screenshot I can see that garbage is being created and then reclaimed by the garbage collector, and the graph itself is a flat line somewhere between 23MB and 24MB.
The memory goes on growing and growing (slowly). When I stop harassing Alertmanager with silences, the memory goes down significantly.
I don't see that in the screenshot above? This also wouldn't happen if there was a memory leak because if there was a memory leak then the memory wouldn't go down?
Can you help me understand?
Let's zoom on the significant period (when alertmanager is being harassed, without the few beginning minutes)
I need to use avg_over_time
to see the tendency. The 2nd graph clearly shows that it grows.
I cannot run my lab for months. But I have Alertmanagers in Production at work that are running for months. They sometimes explode with OOM.
I think the problem is the second graph. It's taking the average over 4 hours, which is skewed from the peaks before a garbage collection happens. The first graph, which has a much smaller window for avg_over_time
, is what I expect to see. I would recommend showing a screenshot of the metric without avg_over_time
and then we can see for sure if there is a leak or not.
That looks good. There is no memory leak in this graph. If you want you can show it for the last 48 hours or 7 days and it should look the same.
These also look good. There is no memory leak in these graphs either!
So hard to reproduce and so hard to interpret this little window of time when the issue needs about one month in our production environment to generate an OOM...
I guest that next time I give a try, I will generate 100 alerts and 100 silences instead of one. We will see if that changes anything.
Thanks for pointing that graphs are only the reflection of what we want to see. And in fact, when I add xxx_over_time
I see what I want to show and without, I consider that the spike hide the global trend.
It definitely needs a new try with more alerts and silences. Next time I have time... Stay tuned :)
I think it would be good to see the graphs for go_memstats_alloc_bytes
and go_memstats_heap_objects
. If there is a memory leak then both go_memstats_alloc_bytes
and go_memstats_heap_objects
should keep increasing over time.
This is our Production environment, with Alertmanager-0.25.0 and being harassed by kthxbye. Both are in cluster.
Spikes of go_memstats_heap_objects are higher and higher...
This is our Production environment, with Alertmanager-0.25.0 and being harassed by kthxbye. Both are in cluster.
In this graph, I presume the two lines are separate Alertmanagers. It's interesting that the yellow line is flat while the green line keeps growing. Is the green line the Alertmanager being tested?
Both are running in the same namespace : they are in cluster. And yes, I was surprised that 2 alertmanagers running exactly the same way and being in cluster do not have the same memory behaviour. I don't know how Alertmanager internals work but I guess that silences may be on one alertmanager and not the other. Both communicate to know if an alert is silenced. If I'm true, it would explain why one alertmanager is running "flat" and the other one with many silences is running "growing".
What did you do?
I'm using Kthxbye. When an alert fires and I add a silence with Kthxbye, the memory usage of Alertmanager increases.
You can reproduce this without Kthxbye :
1/ generate an alert (or use any alert sent by Prometheus), for example
PrometheusNotIngestingSamples
.2/ With Alertmanager, generate silences like this :
Note : The behaviour of Kthxbye is similar, but default config is 15 min instead of 1 min. However, with amtool you can see that Kthxbye has nothing to do with this bug.
What did you expect to see?
Nothing interesting (no abnormal memory increase)
What did you see instead? Under which circumstances?
Follow the metric
container_memory_working_set_bytes
for Alertmanager. After some hours you can see it slightly grow up.Here is a screenshot of the above test, for a little more than 12 hours : test started at 12h20 and finished at 9h the day after.
My Alertmanager is running with the default
--data.retention=120h
. I guessed that after 5 days it would stop increasing. Wrong guess : it stops increasing only at OOM and automatic kill.The above graph was made with Kthxbye running. The pod restarts after an OOM (left side) or after a
kubectl delete pod
(right side).Environment
System information:
Kubernetes (deployed with https://github.com/prometheus-community/helm-charts/tree/main/charts/alertmanager)
Alertmanager version:
➜ k -n monitoring logs caascad-alertmanager-0 level=info ts=2021-07-30T09:09:46.139Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)" level=info ts=2021-07-30T09:09:46.139Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)" level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml level=info ts=2021-07-30T09:09:46.174Z caller=main.go:485 msg=Listening address=:9093 level=warn ts=2021-07-30T12:29:49.530Z caller=notify.go:674 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://xxxx.rocketchat.xxx/hooks/xxxxxx/xxxxxxxxx\": dial tcp x.x.x.x: connect: connection refused" level=info ts=2021-07-30T12:32:17.213Z caller=notify.go:685 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify success" attempts=13