mailcow / mailcow-dockerized

mailcow: dockerized - 🐮 + 🐋 = 💕
https://mailcow.email
GNU General Public License v3.0
8.74k stars 1.17k forks source link

number of threads (and zombie process) #1664

Closed Adorfer closed 5 years ago

Adorfer commented 6 years ago

I see on a vm (ubuntu18.04LTS, 2C, around 5GB RAM) a near constant rate of threads increasing like 2k/24h.

plus a number of zombie-processes (e.g. "There are 272 zombie processes." after 15h uptime) in the end the system is rendered unresponsive (like "no web login for users" + "ssh login takes minutes")

Restarting of the VM cures the effekt, but i like to find the trigger/cause.

(running update.sh today did not change the behaviour.)

What kind of debug information should be provided?

grafik

andryyy commented 6 years ago

Monitoring graphs are nice, but we don't know anything about the actual threads and zombies. :-) Can you please post a task list and/or grep the zombie pids?

Adorfer commented 6 years ago

ps aux | grep 'Z'|wc -l 802

for example:

root     32679  0.0  0.0      0     0 ?        Zs   13:55   0:00 [timeout] <defunct>
root     32714  0.0  0.0      0     0 ?        Zs   18:17   0:00 [timeout] <defunct>
root     32745  0.0  0.0      0     0 ?        Zs   15:11   0:00 [timeout] <defunct>

root@nadmail:/home/adorfer# pstree -p -s 32679
systemd(1)───dockerd(712)───docker-containe(847)───docker-containe(28355)───unbound(28413)───timeout(32679)
root@nadmail:/home/adorfer# pstree -p -s 32714
systemd(1)───dockerd(712)───docker-containe(847)───docker-containe(28355)───unbound(28413)───timeout(32714)
root@nadmail:/home/adorfer# pstree -p -s 32745
systemd(1)───dockerd(712)───docker-containe(847)───docker-containe(28355)───unbound(28413)───timeout(32745)
root@nadmail:/home/adorfer# pstree -p -s 19877
systemd(1)───dockerd(712)───docker-containe(847)───docker-containe(28355)───unbound(28413)───timeout(19877)

full tree: https://paste.tecff.de/?a86f110ced952cb8#uTBOnWcebRRL1LNQQDYL/fJbJWejqwqZMIHUihqCt+E=

andryyy commented 6 years ago

Only the zombies are important. Looks like unbound-mailcow cannot make DNS requests.

Adorfer commented 6 years ago

So what is broken? How to check?

"dig" on the host itself is working fine.

andryyy commented 6 years ago

Is it only unbound to create those zombies?

Can you give me access again?

Adorfer commented 6 years ago

as this is ubuntu 18.04LTS, there is systemd-resolved.conf running, which listens (according to /etc/resolv.conf) hardcoded to 127.0.0.53 So probably i have disable resolved and turn towards dnsmasq, since there i could make it listen to 127.0.0.1

andryyy commented 6 years ago

unbound-mailcow is used as resolver in mailcow. It prevents people from using public resolvers for RBL lookups (mostly blocked) and gives them a DNSSEC enabled resolver.

Docker containers always query 127.0.0.11 (the Dockerd DNS proxy) no matter their config. You can remove...

      dns:
        - ${IPV4_NETWORK:-172.22.1}.254

...from docker-compose.yml for each container and run docker-compose up -d. It will use your systems resolver then but still query 127.0.0.11 inside each container.

Adorfer commented 6 years ago

if i understand right, the proposal is, not to use unbound by the other containers to perform dns, but to use the vm/docker-host dns? (just to make sure, that i get the right drift. My assumption would have been that there is a way to look for the right information why unbound ist creating zombies.)

Adorfer commented 6 years ago

To ask again: Removing the "dns:"-statements from the /opt/mailcow-dockerized/docker-compose.yml

What is the consequence of this action?

(What is that 127.0.0.11 about, what do i have to do then? Sorry, i am totally unfamiliar with the networking of docker and i do not get this from the documentation of mailcow. perhaps it's in there, but i am probably looking at the wrong spot.)

in other words.

It will use your systems resolver then but still query 127.0.0.11 inside each container.

what kind of error messages or spam(?) will i have to face as a result of that?

andryyy commented 6 years ago

Removing the dns parameters will still use 127.0.0.11 in every containers resolv.conf (that's ok!).

Docker uses an internal DNS proxy, so when you use an external resolver, you are still able to resolve internal names inside containers. 8.8.8.8 wouldn't know who postfix-mailcow is. So it catches those queries to a specific zone and returns its own data. Everything else will be forwarded to 8.8.8.8.

Can you explain?

ghost commented 6 years ago

Reading this i would like to ask how to change internal dns (mailcow) to another dns. Google DNS is not very "private" ...

andryyy commented 6 years ago

We don't use Google DNS. This was an example... we use unbound-mailcow.

ghost commented 6 years ago

OK, i was just wondering ... So unbound-mailcow uses dns settings from docker e.g. it's host in /etc/resolv.conf?

Adorfer commented 6 years ago

1) i tried replacing the systemd-resolved by dnsmasq

i mitigaged the issue now by adding to /etc/crontab:

0 /6 root docker restart mailcowdockerized_unbound-mailcow_1

(this keeps the unber of zombies below 2000)

andryyy commented 6 years ago

Can you capture packages and trace the requests?

andryyy commented 6 years ago

You already removed the DNS settings from the docker-compose.yml? Unbound is not in use then anymore.

Adorfer commented 6 years ago

I did remove the " dns: - ${IPV4_NETWORK:-172.22.1}.254" from the docker-compose.yaml plus restared docker are requested. As a result the zombie threads still piled up.

Can you capture packages and trace the requests?

i can pcap with tcpdump. But i have no clue about docker networking. So actually: no i idea how to listen (and from which side) on what interface.

christianbur commented 6 years ago

I can also confirm the problem with the zombie processes (on three servers with Mailcow and Ubuntu 18.04).

Example: --> Number of threads 6600 --> Execute the command "docker restart mailcowdockerized_unbound-mailcow_1" --> Number of threads 1040

andryyy commented 6 years ago

Still never run into this...

Is disable_monitoring = true in data/conf/rspamd/local.d/options.inc?

andryyy commented 6 years ago

Oh, and could you please check your syslog for "more than 100 concurrent queries"? :-)

christianbur commented 6 years ago

root@serverxxx:/data/docker_projects/mailcow-dockerized/data/conf/rspamd/local.d# cat options.inc dns { enable_dnssec = true; } map_watch_interval = 30s; dns { timeout = 15s; retransmits = 5; } disable_monitoring = true;


I have not found "more than 100 concurrent queries" in the sylog of the servers

andryyy commented 6 years ago

Oh, and could you please check your syslog for "more than 100 concurrent queries"? :-)

christianbur commented 6 years ago

I have not found "more than 100 concurrent queries" in the sylog of the servers.

christianbur commented 6 years ago

In unbound log I noticed the following message, which was already reported in #585.

unbound[1:0] error: Could not open logfile /dev/stdout: Permission denied

andryyy commented 6 years ago

I fixed it in the upcoming image.

christianbur commented 5 years ago

@Adorfer I think the checkmk agent condemns the problem, if you deactivate the checkmk monitoring there is no [timeout] <defunct>.

Adorfer commented 5 years ago

disabling of check_mk agent is not an option here. i can just ask for a modification of the agent. What do you propose?

Adorfer commented 5 years ago

at the moment i am executing in the crontab:

0 /6 root docker restart $(docker ps|grep mailcowdockerized_unbound|cut -c252-|tr -d " ")

(without the number of zombies is still skyrocketing, rendering the system unresponsive sooner or later

grafik

if there is a smarter ways to handle this, please let me know.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Adorfer commented 5 years ago

issue was resolved somewhat around end of 2018, after pulling/updating beginning of 2019-01 it disappeared. grafik (the zig-zag before was due to the cron-ed restart of the dnsresolver (unbound) container.)

If anybody likes to drill down to the exact commit, fixing the issue: Feel free to feed me instruction how to test. But for now i will be closing the issue.

andryyy commented 5 years ago

Strange.

But I have never seen it happening anyway. So..

Adorfer commented 5 years ago

If anybody likes to drill down to the exact commit, fixing the issue: Feel free to feed me instruction how to test. But for now i will be closing the issue.

christianbur commented 5 years ago

It wasn't mailcow, the problem was caused by the check_mk agent. After an update of the agent everything was ok. https://mathias-kettner.de/check_mk-werks.php?edition_id=enterprise&branch=master&version=1.6.0i1&werk_search=docker

In Deutsch "wer misst misst mist" :-)