prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.68k stars 2.16k forks source link

Memory leak when repeating silences #2659

Open ngc104 opened 3 years ago

ngc104 commented 3 years ago

What did you do?

I'm using Kthxbye. When an alert fires and I add a silence with Kthxbye, the memory usage of Alertmanager increases.

You can reproduce this without Kthxbye :

1/ generate an alert (or use any alert sent by Prometheus), for example PrometheusNotIngestingSamples.

2/ With Alertmanager, generate silences like this :

while true; do
  date # useless; just for tracing...
  amtool --alertmanager.url http://localhost:9093 silence add alertname=PrometheusNotIngestingSamples -a "MemoryLeakLover" -c "Test memory leak in Alertmanager" -d "1m"
  sleep 50
done

Note : The behaviour of Kthxbye is similar, but default config is 15 min instead of 1 min. However, with amtool you can see that Kthxbye has nothing to do with this bug.

What did you expect to see?

Nothing interesting (no abnormal memory increase)

What did you see instead? Under which circumstances?

Follow the metric container_memory_working_set_bytes for Alertmanager. After some hours you can see it slightly grow up.

Here is a screenshot of the above test, for a little more than 12 hours : test started at 12h20 and finished at 9h the day after.

image

My Alertmanager is running with the default --data.retention=120h. I guessed that after 5 days it would stop increasing. Wrong guess : it stops increasing only at OOM and automatic kill.

image The above graph was made with Kthxbye running. The pod restarts after an OOM (left side) or after a kubectl delete pod (right side).

Environment

/alertmanager $ alertmanager --version
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
  build user:       root@dee35927357f
  build date:       20200617-08:54:02
  go version:       go1.14.4

* Logs:

➜ k -n monitoring logs caascad-alertmanager-0 level=info ts=2021-07-30T09:09:46.139Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)" level=info ts=2021-07-30T09:09:46.139Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)" level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml level=info ts=2021-07-30T09:09:46.174Z caller=main.go:485 msg=Listening address=:9093 level=warn ts=2021-07-30T12:29:49.530Z caller=notify.go:674 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://xxxx.rocketchat.xxx/hooks/xxxxxx/xxxxxxxxx\": dial tcp x.x.x.x: connect: connection refused" level=info ts=2021-07-30T12:32:17.213Z caller=notify.go:685 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify success" attempts=13

roidelapluie commented 3 years ago

Thanks for your report.

Would you be able to take and upload a profile with http://alertmanager/debug/pprof/heap when the memory is high and provide it to us ?

ngc104 commented 3 years ago

Hello, I'm on holidays since today... Too bad...

However I'm not alone on this issue. You should have the profile from someone else soon (let us wait some hours to reproduce the bug again)

Regards

Winael commented 3 years ago

Hi @roidelapluie

I'm in @ngc104 's team. During his holidays I will follow this issues. I tried to reproduce the bug yesterday but unfortunatelly the pod was restarted before I was able to extracted the profile and the heap. I'll try to provide you those informations as soon as I can :)

Thx for your help

Regards,

Winael commented 3 years ago

Hi again @roidelapluie

I was able to reproduce the issue this morning

Capture d’écran de 2021-08-05 11-23-32

Please find the profile here and heap here

Let me know if you need more informations

Regards,

Winael commented 3 years ago

Hi guys

Any update on this case ? Tell me if you're need mode informations

Regards,

Winael commented 3 years ago

Hi @roidelapluie

Can we have an update on this issues ? Do you need more informations ?

Regards

Winael commented 3 years ago

Hi everyone

Will it be possible to have on update on this case ? Could become critical on our side

Regards,

Winael commented 3 years ago

Hi everyone, @roidelapluie

Thanks for your report.

Would you be able to take and upload a profile with http://alertmanager/debug/pprof/heap when the memory is high and provide it to us ?

I sent you the element requested in the begining of august. Could you help us on this case please ?

Regards Winael

roidelapluie commented 3 years ago

Hello,

Somehow the heap dump does not seem to help. I see only a few megabytes of memory used here. I will try to reproduce this on my side asap.

Winael commented 3 years ago

Hi @roidelapluie

Have you succedded to reproduce this issue on your side ? Do you have some informations that could help us please ?

Regards, Winael

roidelapluie commented 3 years ago

i did not have the chance yet. maybe you can come up with a pprof that shows the issue ? the one that was sent previously was not really conclusive.

ngc104 commented 2 years ago

Edited on 13/06/2023 (typo and switch to 0.25.0) Edited on 29/08/2023 (switch to 0.26.0 + run alertmanager --version to show the version + added trace.sh)

Procedure to reproduce the bug

Get alertmanager

(yes, sorry, so complex for just curling a tar.gz and uncompress it)

ALERTMANAGER_VERSION="0.26.0"
ALERTMANAGER_ARCHIVE="alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
ALERTMANAGER_URL="https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/${ALERTMANAGER_ARCHIVE}"

# Download
[ -e "${ALERTMANAGER_ARCHIVE}" ] || curl -sL "${ALERTMANAGER_URL}" -o "${ALERTMANAGER_ARCHIVE}"
[ -d "alertmanager" ] || (tar xzf "${ALERTMANAGER_ARCHIVE}" ; mv "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64" "alertmanager")

Launch alertmanager

➜ cat <<EOF > alertmanager.yml 
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'mytest'
receivers:
  - name: 'mytest'
EOF
./alertmanager/alertmanager --version > version.txt
./alertmanager/alertmanager

Alertmanager is working : http://localhost:9093

Harass alertmanager with alerts and silences

while true; do 
  date; 
  ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ alert add alertname=foo node=bar
  ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ silence add -a "awesome user" -d 1m -c "awesome comment" alertname=foo
  sleep 60
done

The first amtool call will send an alert to Alertmanager just like Prometheus would do it.

The second amtool call will set a silence on that alert just like kthxbye would do it.

Difference with Prometheus/Kthxbye : we harass Alertmanager every 1 minute.

go tool pprof

DATE=$(date +'%Y%m%d_%H%M%S')
FILEPREFIX="profile-${DATE}"

go tool pprof -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs.pdf"

We can also play with go tool pprof -http localhost:12345 http://localhost:9093/debug/pprof/allocs

go tool pprof (since 29/08/2023)

➜ cat <<'EOF' > trace.sh

DATE=$(date +'%Y%m%d_%H%M%S')
REPORTDIR="profiles"
mkdir -p "${REPORTDIR}"
FILEPREFIX="${REPORTDIR}/profile-${DATE}"
export PPROF_TMPDIR="$(mktemp -d /tmp/pprof-XXXXXX)"
echo "Traces in '${PPROF_TMPDIR}'"

echo "=======================================" >> "${REPORTDIR}/report.txt"
echo "${DATE}" >> "${REPORTDIR}/report.txt"
./alertmanager/alertmanager --version >> "${REPORTDIR}/report.txt"
go tool pprof -raw http://localhost:9093/debug/pprof/allocs > /dev/null
mv "${PPROF_TMPDIR}/pprof.alertmanager.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz" "${FILEPREFIX}.pb.gz"
go tool pprof -sample_index=alloc_objects -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=alloc_space -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=inuse_objects -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=inuse_space -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"

rmdir "${PPROF_TMPDIR}"
EOF

Then run "./trace.sh" from times to times

ngc104 commented 2 years ago

pprof profiles.pdf

I'm back on this issue. I harassed Alertmanager for a workday.

Here are some pdf (I don't know if it is the preferred format, sorry)

09h19 profile-20220113_091951-allocs.pdf 10h59 profile-20220113_105909-allocs.pdf 16h55 profile-20220113_165542-allocs.pdf

09h19 4Mb (nodes accounting for 33245.19kB, 78.29% of 42466.72kB total) 10h59 75Mb (nodes accounting for 56.61MB, 75.30% of 75.18MB total) 16h55 173Mb (nodes accounting for 134.34MB, 77.30% of 173.80MB total)

Because I can now reproduce the bug easily, please tell me what kind of pprof traces you need to investigate.

ngc104 commented 2 years ago

Hello @roidelapluie I'm back on this issue (replacing @Winael ). If you can come back on it too (or anybody else), please tell me how to take pprof traces if you need more infos.

Note : I took heap traces too. All the day. On the PDF, I can read Showing nodes accounting for 3609.71kB, 100% of 3609.71kB total at the beginning and at the end of the day. Heap seems to be stable.

prashbnair commented 2 years ago

I was going through the profiles and notice that you are using alloc_space which is about how much memory space was allocated. Can you inuse_space and check if the memory consumption is going up.

ngc104 commented 2 years ago

@prashbnair not sure to understand... You want me to run go tool pprof -sample_index=inuse_space instead of just go tool pprof ?

I will try that ASAP.

Are PDF files the best output for you ?

ngc104 commented 2 years ago

Starting taking the traces. They will be posted later today.

Here is my script to take traces :

#! /bin/bash

DATE=$(date +'%Y%m%d_%H%M%S')
FILEPREFIX="profile-${DATE}"

go tool pprof -sample_index=inuse_space -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-inuse_space.pdf"

go tool pprof -sample_index=inuse_objects -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-inuse_objects.pdf"

go tool pprof -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-alloc_space.pdf"

go tool pprof -sample_index=alloc_objects -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-alloc_objects.pdf"
ngc104 commented 2 years ago

pprof more profiles.pdf

I restarted my test (as described in a previous comment above).

I took some traces right after I started Alertmanager and my loop for the alert, at 09:19. I took other traces after lunch at 14:02. And I ended my test at 17:58.

09:19

14:02

17:58

I tried the 4 sample indexes I found for allocations. I hope that it is what you wanted @prashbnair .

cc @roidelapluie

prashbnair commented 2 years ago

@ngc104 By looking at the inuse_space I notice that it does not go up over time. Are you still able to reproduce the error?

ngc104 commented 2 years ago

@prashbnair Yes.

In fact, the main problem is Kubernetes killing the Alertmanager pod container when going OOM because of the "memory" going higher than the resource limit.

What is the "memory" measured by Kubernetes to kill the container ? If this is not alloc_space, I don't know how to measure it with go pprof.

ngc104 commented 2 years ago

Restarted my test at 9h22 UTC. Here is some interesting result at 10h03 UTC...

➜ ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ silence query --expired
ID                                    Matchers         Ends At                  Created By    Comment          
2fca38e1-9dc3-4fbf-8f35-291dcac02a34  alertname="foo"  2022-05-02 09:22:11 UTC  awesome user  awesome comment  
8ab24f44-c779-4fb7-8ba5-da9061b635d0  alertname="foo"  2022-05-02 09:23:11 UTC  awesome user  awesome comment  
7347ceff-527c-4106-8dc9-861d8205b374  alertname="foo"  2022-05-02 09:24:11 UTC  awesome user  awesome comment  
26299122-2d9c-44dc-bc46-fa945379f4d7  alertname="foo"  2022-05-02 09:25:11 UTC  awesome user  awesome comment  
7d27178d-59ad-443f-81dd-e9351b6a9701  alertname="foo"  2022-05-02 09:26:11 UTC  awesome user  awesome comment  
0219f67a-507a-4b4a-99e4-bf8020fe6615  alertname="foo"  2022-05-02 09:27:11 UTC  awesome user  awesome comment  
73173814-25d6-4181-8de7-0677b8011fca  alertname="foo"  2022-05-02 09:28:11 UTC  awesome user  awesome comment  
82cf631a-07c3-4f90-8131-e7d4ade49d35  alertname="foo"  2022-05-02 09:29:11 UTC  awesome user  awesome comment  
37978ad4-9c1f-4ad7-a6af-d75e33baf75b  alertname="foo"  2022-05-02 09:30:11 UTC  awesome user  awesome comment  
dc17d617-ac31-460a-949f-4021fd945374  alertname="foo"  2022-05-02 09:31:11 UTC  awesome user  awesome comment  
41f5a1b8-ce3a-4a85-a118-672b0dce2267  alertname="foo"  2022-05-02 09:32:11 UTC  awesome user  awesome comment  
d0e8b085-fe67-49dc-9ee2-96e6161a7d0d  alertname="foo"  2022-05-02 09:33:12 UTC  awesome user  awesome comment  
222bae9c-7b07-4a80-997b-06867411d6c8  alertname="foo"  2022-05-02 09:34:12 UTC  awesome user  awesome comment  
025cd910-b437-4455-b925-2744f7a5e027  alertname="foo"  2022-05-02 09:35:12 UTC  awesome user  awesome comment  
0f6dcfc7-8320-4bfb-aab5-fc79fe661546  alertname="foo"  2022-05-02 09:36:12 UTC  awesome user  awesome comment  
3ca66616-f985-4e7c-b60f-7526fa0e7ae6  alertname="foo"  2022-05-02 09:37:12 UTC  awesome user  awesome comment  
0c2ef5b5-5e82-4823-8453-e223f18289a7  alertname="foo"  2022-05-02 09:38:12 UTC  awesome user  awesome comment  
7abe6854-ab5e-4fc8-bd6e-a7495ef8b366  alertname="foo"  2022-05-02 09:39:12 UTC  awesome user  awesome comment  
41a55f9a-b330-494f-a842-8e97e740fb27  alertname="foo"  2022-05-02 09:40:12 UTC  awesome user  awesome comment  
8c361742-e1c0-44c3-8d7b-530771335458  alertname="foo"  2022-05-02 09:41:12 UTC  awesome user  awesome comment  
e95a41c6-91f8-45e2-9bcb-aa21d32fb60e  alertname="foo"  2022-05-02 09:42:12 UTC  awesome user  awesome comment  
6016bdfa-2638-47eb-81b6-502f2317a29f  alertname="foo"  2022-05-02 09:43:12 UTC  awesome user  awesome comment  
11bb8940-1865-4512-8dd2-248535f02997  alertname="foo"  2022-05-02 09:44:12 UTC  awesome user  awesome comment  
51853d4f-a9cc-450c-bc4c-84d40975be83  alertname="foo"  2022-05-02 09:45:12 UTC  awesome user  awesome comment  
200dcc2e-5b5d-4ed8-80cd-3b87c93d35f6  alertname="foo"  2022-05-02 09:46:12 UTC  awesome user  awesome comment  
fec736cf-ff1d-4a03-95e9-9615b834eba6  alertname="foo"  2022-05-02 09:47:12 UTC  awesome user  awesome comment  
1aac8602-7c30-4eda-a37b-a8e54bcf6252  alertname="foo"  2022-05-02 09:48:12 UTC  awesome user  awesome comment  
4b3abc68-5621-400f-9c56-65b05980c568  alertname="foo"  2022-05-02 09:49:12 UTC  awesome user  awesome comment  
1a3276b4-1f61-4df4-8baf-a98906b972f5  alertname="foo"  2022-05-02 09:50:12 UTC  awesome user  awesome comment  
b06f383d-a4e5-45ae-bf09-0b7cb32a2665  alertname="foo"  2022-05-02 09:51:12 UTC  awesome user  awesome comment  
9e6a8329-e395-4578-909e-111f22c9cb8f  alertname="foo"  2022-05-02 09:52:13 UTC  awesome user  awesome comment  
ebcd1791-cb78-48d6-bd46-db768cb42476  alertname="foo"  2022-05-02 09:53:13 UTC  awesome user  awesome comment  
c0f29585-db7a-4560-90d2-825fc99b819a  alertname="foo"  2022-05-02 09:54:13 UTC  awesome user  awesome comment  
0efb0aca-4471-41d7-8daf-a9be279bac23  alertname="foo"  2022-05-02 09:55:13 UTC  awesome user  awesome comment  
a6d2db55-c5c8-4e6e-9eed-e1ec87da51e5  alertname="foo"  2022-05-02 09:56:13 UTC  awesome user  awesome comment  
90bc593e-0518-479a-8c02-4072bc0c4421  alertname="foo"  2022-05-02 09:57:13 UTC  awesome user  awesome comment  
49c4bc3b-57a3-4d6e-8bd8-8166addb18fa  alertname="foo"  2022-05-02 09:58:13 UTC  awesome user  awesome comment  
971734d2-c8ad-40bd-a27c-95c592d65efc  alertname="foo"  2022-05-02 09:59:13 UTC  awesome user  awesome comment  
c773ba25-af00-4045-9a70-b63e75ecd5a8  alertname="foo"  2022-05-02 10:00:13 UTC  awesome user  awesome comment  
fb8f23e9-32ad-4ab4-b449-cf86e34c5917  alertname="foo"  2022-05-02 10:01:13 UTC  awesome user  awesome comment  
4e95d124-d052-403d-b5e6-86d0acc22c17  alertname="foo"  2022-05-02 10:02:13 UTC  awesome user  awesome comment  

Could the "memory leak" be something else than a memory leak ? Could it be the sum of expired silences not being flushed ?

ngc104 commented 2 years ago

I was bad on this one... I was exploring data.retention in the Production environment. We have data.retention unset (e.g. using the default value 120h / 5 days).

With amtool, I notice that I have a lot of silences. But no silence after 5 days. There is no alerts except the 2 or 3 active and suppressed ones (3 alerts won't create a memory leak, right ?)

When I restart Alertmanager, I expect to have a memory consumption that goes continuously growing for 5 days. Then it should become flat. But I observe that it continues to grow after 5 days. So my guess was bad : this is not the sum of expired silences being flushed.

I'm thinking again about a memory leak. But no idea on how to debug it.

ngc104 commented 1 year ago

Hello and happy new year !

Alertmanager-0.25.0 is released. Let's give a try... Same procedure as above (in a previous comment)

12h37 profile-20230106_155103-allocs.pdf 14h06 profile-20230106_140612-allocs.pdf 16h55 profile-20230106_123718-allocs.pdf

12h37 14Mb (nodes accounting for 14467.52kB, 100% of 14467.52kB total) 14h06 43Mb (nodes accounting for 27635.94kB, 63.32% of 43644.80kB total) 16h55 66Mb (nodes accounting for 43.69MB, 65.99% of 66.21MB total)

Nothing seems to have changed in 0.25.0 related to this issue.

ngc104 commented 1 year ago

Hello,

I had not noticed that I was running a standalone alertmanager in cluster mode. In order to remove any noise in the investigation, I retried with this command line :

./alertmanager/alertmanager --cluster.listen-address=""

Then I took some traces as usual :

10h56 : profile-20230113_105632-allocs.pdf 11h34 : profile-20230113_113413-allocs.pdf 14h01 : profile-20230113_140152-allocs.pdf 14h44 : profile-20230113_144421-allocs.pdf 18h58 : profile-20230113_185807-allocs.pdf

10h56 : 15Mb (nodes accounting for 15216.73kB, 100% of 15216.73kB total) 11h34 : 23Mb (nodes accounting for 17373.59kB, 73.87% of 23519.01kB total) 14h01 : 47Mb (nodes accounting for 34402.62kB, 71.93% of 47825.89kB total) 14h44 : 56Mb (nodes accounting for 40.78MB, 72.40% of 56.33MB total) 18h58 : 99Mb (Showing nodes accounting for 72MB, 72.58% of 99.20MB total)

I hope these traces are easier to use without the noise of the data structures needed for the absent cluster.

simonpasquier commented 1 year ago

:wave: it would help if you could share the raw pprof flies.

ngc104 commented 1 year ago

Hello,

Sorry for the delay... Too much work besides this incident. And I'm not sure that the "raw" pprof files are obtained just with -raw option instead of -pdf. Please give me more info if the following does not fit.

And I forgot to start Alertmanager with --cluster.listen-address="". So this one was a mono-cluster...

Here are the logs : profiles.zip

Here is my script to generate them :

#! /bin/bash

DATE=$(date +'%Y%m%d_%H%M%S')
mkdir -p profiles
FILEPREFIX="profiles/profile-${DATE}"

go tool pprof -sample_index=inuse_space -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-inuse_space.raw"
go tool pprof -sample_index=inuse_objects -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-inuse_objects.raw"
go tool pprof -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-alloc_space.raw"
go tool pprof -sample_index=alloc_objects -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-alloc_objects.raw"

Edit : I also upload binary files (I have no idea how they work, but I guess you like them too) : pprof.zip

cc @simonpasquier

simonpasquier commented 1 year ago

I missed the fact that you're using version 0.21.0. Can you try with v0.25.0?

ngc104 commented 1 year ago

All tests were done with latest version.

In 2021 when I created the issue it was tested with 0.21.0. Latest test in 2023 (june 13th) was with 0.25.0.

Every time a new version was released, I tested again with the hope it was solved. But with 0.25 it is not solved yet.

ngc104 commented 1 year ago

Hello,

Alertmanager 0.26 is released :partying_face:

The bug is still there :bomb:

I modified my comment above with the procedure to reproduce the bug (I changed the way to run go tool pprof).

And here are new profiles.zip files.

Below is an extract of the report.txt file that I added in the zip file :

alertmanager, version 0.26.0 (branch: HEAD, revision: d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d)
  build user:       root@df8d7debeef4
  build date:       20230824-11:11:58
  go version:       go1.20.7
  platform:         linux/amd64
  tags:             netgo
...
=======================================
20230829_114408
Showing nodes accounting for 155789, 99.48% of 156600 total
Showing nodes accounting for 34175.86kB, 100% of 34175.86kB total
=======================================
20230829_124408
Showing nodes accounting for 293671, 98.51% of 298100 total
Showing nodes accounting for 52.35MB, 100% of 52.35MB total
=======================================
20230829_134408
Showing nodes accounting for 376318, 96.80% of 388746 total
Showing nodes accounting for 63.89MB, 100% of 63.89MB total
=======================================
20230829_144408
Showing nodes accounting for 432682, 95.95% of 450947 total
Showing nodes accounting for 81.63MB, 100% of 81.63MB total
=======================================
20230829_154408
Showing nodes accounting for 492675, 94.88% of 519250 total
Showing nodes accounting for 94.59MB, 100% of 94.59MB total
=======================================
20230829_164408
Showing nodes accounting for 666111, 94.82% of 702485 total
Showing nodes accounting for 98.54MB, 86.75% of 113.58MB total
=======================================
20230829_174409
Showing nodes accounting for 734087, 95.01% of 772671 total
Showing nodes accounting for 119.25MB, 89.80% of 132.79MB total
=======================================
20230829_184409
Showing nodes accounting for 819591, 94.48% of 867496 total
Showing nodes accounting for 136.83MB, 91.30% of 149.87MB total

As you can see something is growing... (dates are in format YYYYMMDD_HHMMSS)

cc @simonpasquier

grobinson-grafana commented 1 year ago

Hi! 👋 I attempted to reproduce this on Alertmanager 0.26.0 using the same script but so far I have not been able to. I ran the script for 1h:

#!/bin/sh

while true; do
  ./amtool --alertmanager.url=http://127.0.0.1:9093 alert add foo=bar
  ./amtool --alertmanager.url=http://127.0.0.1:9093 silence add foo=bar -c="" -d 1m
  sleep 5
done

Here is go_memstats_heap_objects and go_memstats_heap_inuse_bytes:

Screenshot 2023-11-19 at 1 05 30 pm Screenshot 2023-11-19 at 1 06 03 pm
grobinson-grafana commented 1 year ago

To see if it was related to data retention not deleting expired silences I ran Alertmanager with --data.retention=5m.

ngc104 commented 9 months ago

How to reproduce

I was fed up with rebuilding my env test each time I wanted to test if the bug is still there. I also thought it would be good to make it easier for everyone to build it. So I wrote some scripts : https://github.com/ngc104/alertmanager-issue-2659

TL;DR :

./buid_env.sh
./run.sh

But the README.md file is quite short. You can read it.

How to measure the leak

I can see the bug in Kubernetes environment with container_memory_working_set_bytes metric. But what is that metric ?

In very short, this metric is close to the RSS metric for a process and comes from Linux cgroups.

This is the metric used by Kubernetes to detect OOMs and kill the containers. For that specific reason, it is important to use exactly that metric and not other similar ones.

After a lot of readings, I decided to

My script is 25 lines long and should work on most Linux (cgroup v1 and v2)... Not so hard to understand and fix if it does not work for you.

How to show the leeak

Thanks to your graphs @grobinson-grafana I thought I could provide grafs too, and let anyone have their own graphs easily.

So my env test comes from :

Now it's easy to run this stack and provide graphs.

ngc104 commented 9 months ago

Some graphs with the stack above

The stack was launched for hours with different data retention.

Alertmanager version : 0.26.0

Data retention 2h

Start : 10h35 End : 21h10 Duration : 10h35

Capture d’écran de 2024-01-29 22-36-00 Capture d’écran de 2024-01-29 22-35-43 Capture d’écran de 2024-01-29 22-35-23

EDIT 02/02/2024 : we had more metrics than we thought. Here are new and larger graphs (retention 2h)

Start : 10h35 End : 10h27 Duration : 23h08

image image image

Data retention 3h

Start : 10h50 End : 10h20 Duration : 23h30

Capture d’écran de 2024-01-31 10-17-15 Capture d’écran de 2024-01-31 10-19-16 Capture d’écran de 2024-01-31 10-19-52

Data retention 5m

Start : 11h40 End : 21h00 Duration : 9h20

Note : this is the retention that @grobinson-grafana used for the test 2 months and a half ago.

Capture d’écran de 2024-01-31 21-01-14 Capture d’écran de 2024-01-31 21-01-00 Capture d’écran de 2024-01-31 21-01-43

ngc104 commented 9 months ago

Analysing the graphs above

Note : the metrics have been aggregated with avg_over_time(...[15m]) to remove spikes and see the trends better.

Nb of expired silences

The number of expired silences is mostly the same (except at the beginning of course).

The exception is for dataretention=5m : why does it increase so much 2 hours after the beginning and then stbilize 1h30 after the increase ? Well, that is not the object of this issue. Let's just notice it then forget it for while.

memstat_heap_objects

This metric comes from alertmanager itself.

The number of heap objects looks stable (after the growing phase at beginning).

Notice that with dataretention=5m the growing phase looks strange with not growing so much.

This groing phase is 2h with dataretention=2h and is also 2h with dataretention=3h. With dataretention=5m, the strange start will also last something like 3h.

I see no link between this metric and the memory leak.

Workingset size

This is the main metric. This is why the issue was opened.

With dataretention=5m, the issue does not appear. It did not appear on @grobinson-grafana 's test. It does not appear here either. I guess that using dataretention=5m will hide the bug.

With dataretention=2h we can see the Workingset size growing slowly until the "cliff". After the cliff, we have not enough data to say if is a flat infinite step or if it will still grow. Maybe we should run this test again.

EDIT 02/02/2024 : with newer graphs on a larger time window, we see that the Workingset size continues growing. Why are there steps ? No idea.

With dataretention=3h we can clearly see the Workingset size growing slowly but continuously.

The bug is there. But we need retentionsize >= 3h and a lot of time to see it

But remember that we run Alertmanager on Kubernetes for months and at the end, the container is killed because of an OOM.

On this test with dataretention=3h, if memory limit was set to 2.25M the container would have been killed after 20h running. Not before.

ngc104 commented 9 months ago

cc @simonpasquier (I don't know if you are still working on it or if you can do anything).

If the 3 comments were too long to read, please check my little lab to reproduce the bug : https://github.com/ngc104/alertmanager-issue-2659/tree/main and pay attention to the metric workingset size : this is the metric that causes Kubernetes (in reality, Linux) to OOM-kill Alertmanager.

ngc104 commented 7 months ago

Alertmanager-0.27

The memory leak is still there

I'm running https://github.com/ngc104/alertmanager-issue-2659/tree/main over 1 day and a half.

image

image

image

I set the parameter --data.retention=3h. All other parameters are default ones (except --cluster.listen-address="" but that does not matter).

During all the test, including after 9h13, the same alert is renewed every one minute with amtool --alertmanager.url=http://localhost:9093/ alert add alertname=foo node=bar.

The memory goes on growing and growing (slowly). When I stop harassing Alertmanager with silences, the memory goes down significantly.

The strange things:

  1. after 3 hours I should reach some maximum value. But I don't. This is the issue. Still there.
  2. when I stop harassing Alertmanager, the memory goes down (which is normal : old silences are dropped and there are no newer silences to keep in memory). But : after 3 hours, the memory still goes down. I expected that after 3 hours I reach a newer (and lower) maximum limit. But it still goes down !

Note : the alert is set every one minute with amtool alert add, even after I stopped

The history of this issue is long but the issue has not changed since the beginning. It's still there in 2024 with Alertmanager-0.27.0. And the use-case with Kthxbye is still there, with OOM after one or two months running with this small memory leak.

grobinson-grafana commented 7 months ago

I'm sorry but I'm still don't see where the memory leak is in your screenshots. This looks perfect to me? In the screenshot I can see that garbage is being created and then reclaimed by the garbage collector, and the graph itself is a flat line somewhere between 23MB and 24MB.

image

The memory goes on growing and growing (slowly). When I stop harassing Alertmanager with silences, the memory goes down significantly.

I don't see that in the screenshot above? This also wouldn't happen if there was a memory leak because if there was a memory leak then the memory wouldn't go down?

Can you help me understand?

ngc104 commented 7 months ago

Let's zoom on the significant period (when alertmanager is being harassed, without the few beginning minutes)

image

image

I need to use avg_over_time to see the tendency. The 2nd graph clearly shows that it grows.

I cannot run my lab for months. But I have Alertmanagers in Production at work that are running for months. They sometimes explode with OOM.

grobinson-grafana commented 7 months ago

I think the problem is the second graph. It's taking the average over 4 hours, which is skewed from the peaks before a garbage collection happens. The first graph, which has a much smaller window for avg_over_time, is what I expect to see. I would recommend showing a screenshot of the metric without avg_over_time and then we can see for sure if there is a leak or not.

ngc104 commented 7 months ago

image

grobinson-grafana commented 7 months ago

That looks good. There is no memory leak in this graph. If you want you can show it for the last 48 hours or 7 days and it should look the same.

ngc104 commented 7 months ago

image

image

grobinson-grafana commented 7 months ago

These also look good. There is no memory leak in these graphs either!

ngc104 commented 7 months ago

So hard to reproduce and so hard to interpret this little window of time when the issue needs about one month in our production environment to generate an OOM...

I guest that next time I give a try, I will generate 100 alerts and 100 silences instead of one. We will see if that changes anything.

Thanks for pointing that graphs are only the reflection of what we want to see. And in fact, when I add xxx_over_time I see what I want to show and without, I consider that the spike hide the global trend.

It definitely needs a new try with more alerts and silences. Next time I have time... Stay tuned :)

grobinson-grafana commented 7 months ago

I think it would be good to see the graphs for go_memstats_alloc_bytes and go_memstats_heap_objects. If there is a memory leak then both go_memstats_alloc_bytes and go_memstats_heap_objects should keep increasing over time.

ngc104 commented 7 months ago

image

This is our Production environment, with Alertmanager-0.25.0 and being harassed by kthxbye. Both are in cluster.

ngc104 commented 7 months ago

image image

ngc104 commented 7 months ago

Spikes of go_memstats_heap_objects are higher and higher...

grobinson-grafana commented 7 months ago

image

This is our Production environment, with Alertmanager-0.25.0 and being harassed by kthxbye. Both are in cluster.

In this graph, I presume the two lines are separate Alertmanagers. It's interesting that the yellow line is flat while the green line keeps growing. Is the green line the Alertmanager being tested?

ngc104 commented 7 months ago

Both are running in the same namespace : they are in cluster. And yes, I was surprised that 2 alertmanagers running exactly the same way and being in cluster do not have the same memory behaviour. I don't know how Alertmanager internals work but I guess that silences may be on one alertmanager and not the other. Both communicate to know if an alert is silenced. If I'm true, it would explain why one alertmanager is running "flat" and the other one with many silences is running "growing".