Closed ThinIce closed 3 years ago
I'm also seeing this, also on a 4GB server. Seems to happen periodically (no pattern that I can detect) making the server essentially unresponsive for periods of 8-14 mins.
I mean... we can try to give it a memory limit. Or remove signatures. Limiting resources via compose may introduce new fancy problems on some systems.
Any ideas anybody?
@mkuron ?
If it helps, this is new - it only seems to be an issue with the latest version of the clamav container.
We just updated to 0.103.0, it is possible this version has a higher need for memory.
Ah! This from the release notes for 0.103:
" clamd can now reload the signature database without blocking scanning. This multi-threaded database reload improvement was made possible thanks to a community effort. Non-blocking database reloads are now the default behavior. Some systems that are more constrained on RAM may need to disable non-blocking reloads, as it will temporarily consume double the amount of memory. We added a new clamd config option ConcurrentDatabaseReload, which may be set to no."
Good catch. :)
I wonder is there a way to expose that ConcurrentDatabaseReload setting as an option in mailcow.conf?
I set it to no by default for now and will add it to the docs. All options can be set via data/conf/clamav/clamd.conf
.
Perfect! Thank you very much indeed. Will drop another donation in the morning :)
I thank you. :)
I set it to no by default for now and will add it to the docs. All options can be set via
data/conf/clamav/clamd.conf
.
Just seen this by coincidence. Would prefer managing any kind of config using mailcow.conf if it is possible. It's a good approach to bundle config settings in one file if possible instead of having to edit several files laying somewhere in "unknown" locations. Would make managing things easier.
We cannot handle every single config there.
We use git for this reason. It will not kill your changes as long as we didn't change it either. If we did, we need to overwrite it for compatibility.
All config files share the same location by the way: data/conf
My ClamAV is also running OOM when updating even with ConcurrentDatabaseReload set to NO, since a week; it just happened 10 minutes ago
How much RAM?
4GB. I know it's not a lot but it has been working fine for 2 years, until now =)
Silly question but I noticed you say it's set to NO. Does your clamd.conf say "ConcurrentDatabaseReload no" or "ConcurrentDatabaseReload NO" ? IIRC clamd.conf directives are case sensitive.
Yes it's in lower case. It was added by a recent commit
ConcurrentDatabaseReload no
"no" is fine.
You can try to decrease SOGo worker count to 10.
This has been fine since the ConcurrentDatabaseReload change but just bombed again this evening
clamd-mailcow_1 | receiving incremental file list
clamd-mailcow_1 | ./
clamd-mailcow_1 | blurl.ndb
clamd-mailcow_1 | jurlbl.ndb
clamd-mailcow_1 | phishtank.ndb
clamd-mailcow_1 | rogue.hdb
clamd-mailcow_1 |
clamd-mailcow_1 | sent 23,606 bytes received 314,924 bytes 225,686.67 bytes/sec
clamd-mailcow_1 | total size is 18,879,877 speedup is 55.77
clamd-mailcow_1 | RELOADING
clamd-mailcow_1 | Wed Oct 7 17:16:27 2020 -> Reading databases from /var/lib/clamav
clamd-mailcow_1 | Wed Oct 7 17:17:28 2020 -> Database correctly reloaded (9073104 signatures)
clamd-mailcow_1 | Wed Oct 7 17:17:28 2020 -> Database reload completed.
clamd-mailcow_1 | Wed Oct 7 17:17:28 2020 -> Activating the newly loaded database...
clamd-mailcow_1 | Wed Oct 7 17:17:28 2020 -> instream(local): OK
clamd-mailcow_1 | Wed Oct 7 17:17:28 2020 -> instream(172.22.1.13@44646): OK
clamd-mailcow_1 | Wed Oct 7 17:17:28 2020 -> instream(local): OK
clamd-mailcow_1 | Wed Oct 7 17:19:09 2020 -> instream(172.22.1.13@45048): OK
clamd-mailcow_1 | Wed Oct 7 17:24:06 2020 -> instream(local): OK
clamd-mailcow_1 | Wed Oct 7 17:27:21 2020 -> ClamAV update process started at Wed Oct 7 17:27:21 2020
clamd-mailcow_1 | Wed Oct 7 17:27:26 2020 -> daily database available for update (local version: 25949, remote version: 25950)
clamd-mailcow_1 | Wed Oct 7 17:27:59 2020 -> Testing database: '/var/lib/clamav/tmp.c550237f77/clamav-fa1482832880b3b414a882962cbfb28f.tmp-daily.cld' ...
clamd-mailcow_1 | /clamd.sh: line 97: 23 Killed nice -n10 clamd
clamd-mailcow_1 | /clamd.sh: line 98: kill: (23) - No such process
clamd-mailcow_1 | Worker 23 died, stopping container waiting for respawn...
clamd-mailcow_1 | Cleaning up tmp files...
clamd-mailcow_1 | Copying non-empty whitelist.ign2 to /var/lib/clamav/whitelist.ign2
clamd-mailcow_1 | File: /var/lib/clamav/whitelist.ign2
clamd-mailcow_1 | Size: 142 Blocks: 8 IO Block: 4096 regular file
clamd-mailcow_1 | Device: 5fh/95d Inode: 1048148 Links: 1
clamd-mailcow_1 | Access: (0644/-rw-r--r--) Uid: ( 700/ clamav) Gid: ( 700/ clamav)
clamd-mailcow_1 | Access: 2020-10-07 17:16:27.404000000 +0000
clamd-mailcow_1 | Modify: 2020-10-07 18:12:30.404000000 +0000
clamd-mailcow_1 | Change: 2020-10-07 18:12:30.460000000 +0000
clamd-mailcow_1 | Birth: -
clamd-mailcow_1 | dos2unix: converting file /var/lib/clamav/whitelist.ign2 to Unix format...
clamd-mailcow_1 | Running freshclam...
clamd-mailcow_1 | Wed Oct 7 18:12:30 2020 -> ClamAV update process started at Wed Oct 7 18:12:30 2020
clamd-mailcow_1 | Wed Oct 7 18:12:31 2020 -> daily database available for update (local version: 25949, remote version: 25950)
clamd-mailcow_1 | Wed Oct 7 18:12:48 2020 -> Testing database: '/var/lib/clamav/tmp.46a08b2ade/clamav-b81f532048bf594a68b1079705518bf7.tmp-daily.cld' ...
clamd-mailcow_1 | Wed Oct 7 18:13:19 2020 -> Database test passed.
clamd-mailcow_1 | Wed Oct 7 18:13:19 2020 -> daily.cld updated (version: 25950, sigs: 4328320, f-level: 63, builder: raynman)
clamd-mailcow_1 | Wed Oct 7 18:13:19 2020 -> main.cvd database is up to date (version: 59, sigs: 4564902, f-level: 60, builder: sigmgr)
clamd-mailcow_1 | Wed Oct 7 18:13:19 2020 -> bytecode.cvd database is up to date (version: 331, sigs: 94, f-level: 63, builder: anvilleg)
clamd-mailcow_1 | Wed Oct 7 18:13:19 2020 -> ^Clamd was NOT notified: Can't connect to clamd through /run/clamav/clamd.sock: Connection refused
I can decrease SoGo workers as you've recommended, but I don't actually have any users using it, which I guess would make a difference?
Could you run docker stats
every minute and check whether the memory size of the clamd container (or any other container) grows significantly over time? clamd is probably the process with the largest memory usage on your server, so the oom killer kills it even if it‘s not the culprit.
They can still eat some RAM when you switch between these workers with each new request. Please try docker stats as mkuron suggested and also reduce the worker count. :)
Same issue here, even with ConcurrentDatabaseReload set to no. Every few hours it'll freeze for a minute or two.
Up until this point, mailcow has been running great for over a year on this server.
Clamd log: https://pastebin.com/TBjBUPn9
So? I cannot change that. If you want to keep using ClamAV, you need more RAM. 👍
Or reduce the SOGo workers. I cannot change that I'm afraid. :/
We will update the requirements.
:(
I get it of course.. Just hoped you'd have a solution for me :)
Thanks anyway, I'll upgrade the RAM of my server.
Or at least try with less workers in SOGo first. :)
Could you run
docker stats
every minute and check whether the memory size of the clamd container (or any other container) grows significantly over time? clamd is probably the process with the largest memory usage on your server, so the oom killer kills it even if it‘s not the culprit.
I disabled the clamv container and have been watching the others since your post. It looks like on my single system over a working week solr grows slowly, but only by about 100MiB from where it starts (350-450). It looks like rspamd spikes quite severely at times from a resting ~250, I think about 650 is the highest I've seen it. Redis also seems to be capable of varying by a few hundred MiB, presumably depending on what's going on at the time, but nothing has an obvious memory leak.
I'll have to look at something to do this monitoring more scientifically over a longer period and produce some graphs but it seems like maybe the issue is indeed just clam plus other things using more memory at the time down to random usage.
For now I'll reduce the SOGo workers and re-enable clamd and keep an eye on it.
Just hoped you'd have a solution for me
Unfortunately, ClamAV is quite a memory hog because it loads all its virus definitions into memory, and those obviously get larger with every update. You‘ll need to reduce the set of virus definitions to reduce memory usage. Or reconsider whether you actually need a virus scanner: we block .exe attachments and MS Office documents with macros, which should already take care of most virus distribution vectors.
It looks like on my single system over a working week solr grows slowly, but only by about 100MiB from where it starts (350-450).
SOLR needs quite a lot of memory, depending on how many messages you have. It is recommended to be kept disabled unless you have a lot of memory and very few users.
It looks like rspamd spikes quite severely at times from a resting ~250, I think about 650 is the highest I've seen it.
Rspamd uses Lua, which is garbage-collected. You can reduce the garbage collection timeout (https://github.com/mailcow/mailcow-dockerized/issues/3049#issuecomment-548012475) to keep its memory usage more constant.
Redis also seems to be capable of varying by a few hundred MiB, presumably depending on what's going on at the time,
Redis is an in-memory database that periodically dumps its state to a file. Its memory footprint probably grows as it accumulates transactions between dumps, but I have not seen it consume an unreasonable amount of memory.
I run my server in AWS with 2GB ram.. since I'm running this for personal sites and some friends domains and it's not very high traffic. I however did create a 4GB swap file and have had no issues... Not saying thats suitable for everyone, but may be an option for you if it's not a high traffic server.
My server does need to use the swapfile.... and for i see no reason to pay for more memory: root@mail:/opt/mailcow-dockerized# free -m total used free shared buff/cache available Mem: 1949 1342 115 9 491 444 Swap: 4095 2087 2008
@jjkondrat And you are suffering from the same issue "clamd getting oom-killed", so the swap file does not help?
@Adorfer No and I have never had any memory problems on any of the containers using the large swap file. I've been using a large swap file since I built my server well over a year ago.
so what is your point posting to this thread? "adding swap may resolve the issue"?
Yes. Although more memory would be better, if the user defines or increases swap file size they may avoid the crash.....
so I already disabled clamd in mailcow.conf but still getting oom messages that seem related to clamd.
I am running mailcow inside an esxi vm with 2cpu & 4GB RAM.
For me it feels like clamd is still running even disabled via mailcow.conf (last restart of mailcow and server was about 14 days ago)
example docker-compose log entries from restart
clamd-mailcow_1 | Mon Dec 7 09:38:54 2020 -> instream(172.22.1.10@43964): OK clamd-mailcow_1 | Mon Dec 7 09:47:45 2020 -> instream(local): OK clamd-mailcow_1 | Mon Dec 7 09:49:22 2020 -> instream(172.22.1.10@45540): OK clamd-mailcow_1 | Mon Dec 7 09:49:38 2020 -> instream(local): OK clamd-mailcow_1 | Mon Dec 7 09:58:34 2020 -> instream(172.22.1.10@46910): OK clamd-mailcow_1 | Mon Dec 7 10:04:24 2020 -> instream(local): OK clamd-mailcow_1 | Mon Dec 7 10:07:30 2020 -> instream(172.22.1.10@48194): OK clamd-mailcow_1 | Mon Dec 7 10:12:44 2020 -> instream(local): OK clamd-mailcow_1 | Mon Dec 7 10:14:23 2020 -> instream(172.22.1.10@49214): OK clamd-mailcow_1 | Mon Dec 7 10:17:19 2020 -> instream(local): OK clamd-mailcow_1 | Mon Dec 7 10:17:33 2020 -> instream(172.22.1.10@49704): OK clamd-mailcow_1 | Worker 22 died, stopping container waiting for respawn... clamd-mailcow_1 | /clamd.sh: line 97: 22 Killed nice -n10 clamd clamd-mailcow_1 | /clamd.sh: line 98: kill: (22) - No such process clamd-mailcow_1 | Cleaning up tmp files... clamd-mailcow_1 | Copying non-empty whitelist.ign2 to /var/lib/clamav/whitelist.ign2 clamd-mailcow_1 | File: /var/lib/clamav/whitelist.ign2 clamd-mailcow_1 | Size: 142 Blocks: 8 IO Block: 4096 regular file clamd-mailcow_1 | Device: 801h/2049d Inode: 1724287 Links: 1 clamd-mailcow_1 | Access: (0644/-rw-r--r--) Uid: ( 700/ clamav) Gid: ( 700/ clamav) clamd-mailcow_1 | Access: 2020-12-07 08:42:07.698887776 +0100 clamd-mailcow_1 | Modify: 2020-12-07 10:44:34.718272781 +0100 clamd-mailcow_1 | Change: 2020-12-07 10:44:35.406301724 +0100 clamd-mailcow_1 | Birth: - clamd-mailcow_1 | dos2unix: converting file /var/lib/clamav/whitelist.ign2 to Unix format... clamd-mailcow_1 | Running freshclam...
example /var/log/messages entries related to the oom
Dec 7 10:41:21 mailstation kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Dec 7 10:41:21 mailstation kernel: 51489 total pagecache pages Dec 7 10:41:21 mailstation kernel: 49539 pages in swap cache Dec 7 10:41:21 mailstation kernel: Swap cache stats: add 7151097, delete 7101558, find 113127588/114180952 Dec 7 10:41:21 mailstation kernel: Free swap = 0kB Dec 7 10:41:21 mailstation kernel: Total swap = 2095100kB Dec 7 10:41:21 mailstation kernel: 1048446 pages RAM Dec 7 10:41:21 mailstation kernel: 0 pages HighMem/MovableOnly Dec 7 10:41:21 mailstation kernel: 35750 pages reserved Dec 7 10:41:21 mailstation kernel: 0 pages hwpoisoned Dec 7 10:41:21 mailstation kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name Dec 7 10:41:21 mailstation kernel: [ 6202] 401 6202 1050 63 6 3 24 0 anvil Dec 7 10:41:21 mailstation kernel: [ 6203] 402 6203 1115 89 6 3 47 0 log Dec 7 10:41:21 mailstation kernel: [ 6204] 402 6204 2060 0 8 3 174 0 managesieve-log Dec 7 10:41:21 mailstation kernel: [ 6205] 401 6205 2813 256 8 3 308 0 stats Dec 7 10:41:21 mailstation kernel: [ 6206] 0 6206 2367 454 10 3 397 0 config Dec 7 10:41:21 mailstation kernel: [ 6208] 401 6208 6019 151 15 3 291 0 auth Dec 7 10:41:21 mailstation kernel: [ 6226] 101 6226 10984 103 13 3 177 0 tlsmgr Dec 7 10:41:21 mailstation kernel: [29783] 82 29783 59769 576 34 3 1513 0 php-fpm Dec 7 10:41:21 mailstation kernel: [12579] 0 12579 27180 159 10 5 73 1 containerd-shim Dec 7 10:41:21 mailstation kernel: [12594] 0 12594 61127 654 94 3 23559 0 rspamd Dec 7 10:41:21 mailstation kernel: [12735] 101 12735 61127 615 91 3 22719 0 rspamd Dec 7 10:41:21 mailstation kernel: [12736] 101 12736 61127 817 93 3 22616 0 rspamd Dec 7 10:41:21 mailstation kernel: [12739] 101 12739 61127 501 96 3 22767 0 rspamd Dec 7 10:41:21 mailstation kernel: [19409] 101 19409 463594 62805 795 5 41791 0 rspamd Dec 7 10:41:21 mailstation kernel: [ 8628] 82 8628 59768 594 34 3 1587 0 php-fpm Dec 7 10:41:21 mailstation kernel: [ 8636] 999 8636 88806 63793 169 3 3315 0 sogod Dec 7 10:41:21 mailstation kernel: [ 8975] 999 8975 85152 525 162 3 63746 0 sogod Dec 7 10:41:21 mailstation kernel: [22578] 82 22578 59770 592 34 3 1579 0 php-fpm Dec 7 10:41:21 mailstation kernel: [22579] 82 22579 59770 612 34 3 1559 0 php-fpm Dec 7 10:41:21 mailstation kernel: [19575] 0 19575 10767 18 24 3 109 0 systemd-journal Dec 7 10:41:21 mailstation kernel: [19783] 0 19783 27180 101 11 4 69 1 containerd-shim Dec 7 10:41:21 mailstation kernel: [19799] 0 19799 569 5 6 3 15 0 tini Dec 7 10:41:21 mailstation kernel: [19874] 0 19874 933 38 7 3 32 0 clamd.sh Dec 7 10:41:21 mailstation kernel: [19888] 0 19888 933 34 7 3 31 0 clamd.sh Dec 7 10:41:21 mailstation kernel: [19889] 0 19889 933 24 7 3 48 0 clamd.sh Dec 7 10:41:21 mailstation kernel: [19890] 700 19890 403098 233170 646 4 75501 0 clamd
Any help on this?
Prior to placing the issue, please check following: (fill out each checkbox with an
X
once done)Description of the bug:
Since last update containing https://github.com/mailcow/mailcow-dockerized/commit/567064ed509db373e52d67f944677984030a2389 clamd has been using much more memory, to the extent that the server oom kills it. server has 4GB of ram and has been running without issue (regularly updated).
I'm unsure whether it is a particular message that triggers this or the clamd update process. After the instance logged below, it ran fine all day before then causing problems again mid evening. I've had to disable clamd for now in the config file.
Docker container logs of affected containers:
Reproduction of said bug:
Logged into server, stopped clamd container, rebooted to make sure server wasn't in inconsistant state after oom. Observed server for the working day, no problem, issue then reoccurred mid evening.
System information:
docker-default (enforce) /usr/sbin/tcpdump (enforce) /usr/lib/snapd/snap-confine (enforce) /usr/lib/snapd/snap-confine//mount-namespace-capture-helper (enforce) man_groff (enforce) man_filter (enforce) /usr/bin/man (enforce) /usr/bin/lxc-start (enforce) /usr/lib/connman/scripts/dhclient-script (enforce) /usr/lib/NetworkManager/nm-dhcp-helper (enforce) /usr/lib/NetworkManager/nm-dhcp-client.action (enforce) /sbin/dhclient (enforce) lxc-container-default-with-nesting (enforce) lxc-container-default-with-mounting (enforce) lxc-container-default-cgns (enforce) lxc-container-default (enforce) | | Virtualization technlogy (KVM, VMware, Xen, etc - LXC and OpenVZ are not supported | KVM | | Server/VM specifications (Memory, CPU Cores) | 4GB, 1 core| | Docker Version (
docker version
) | 19.03.12 | | Docker-Compose Version (docker-compose version
) | docker-compose version 1.27.2, build 18f557f9 docker-py version: 4.3.1 CPython version: 3.7.7 OpenSSL version: OpenSSL 1.1.0l 10 Sep 2019 | | Reverse proxy (custom solution) | Nope |git diff origin/master
, any other changes to the code? If so, please post them. No, nothing.iptables -L -vn
,ip6tables -L -vn
,iptables -L -vn -t nat
andip6tables -L -vn -t nat
. Nope.docker exec -it $(docker ps -qf name=acme-mailcow) dig +short stackoverflow.com @172.22.1.254
(set the IP accordingly, if you changed the internal mailcow network) and post the output.151.101.1.69 151.101.193.69 151.101.129.69 151.101.65.69