Closed Adorfer closed 5 years ago
Monitoring graphs are nice, but we don't know anything about the actual threads and zombies. :-) Can you please post a task list and/or grep the zombie pids?
ps aux | grep 'Z'|wc -l 802
for example:
root 32679 0.0 0.0 0 0 ? Zs 13:55 0:00 [timeout] <defunct>
root 32714 0.0 0.0 0 0 ? Zs 18:17 0:00 [timeout] <defunct>
root 32745 0.0 0.0 0 0 ? Zs 15:11 0:00 [timeout] <defunct>
root@nadmail:/home/adorfer# pstree -p -s 32679
systemd(1)───dockerd(712)───docker-containe(847)───docker-containe(28355)───unbound(28413)───timeout(32679)
root@nadmail:/home/adorfer# pstree -p -s 32714
systemd(1)───dockerd(712)───docker-containe(847)───docker-containe(28355)───unbound(28413)───timeout(32714)
root@nadmail:/home/adorfer# pstree -p -s 32745
systemd(1)───dockerd(712)───docker-containe(847)───docker-containe(28355)───unbound(28413)───timeout(32745)
root@nadmail:/home/adorfer# pstree -p -s 19877
systemd(1)───dockerd(712)───docker-containe(847)───docker-containe(28355)───unbound(28413)───timeout(19877)
full tree: https://paste.tecff.de/?a86f110ced952cb8#uTBOnWcebRRL1LNQQDYL/fJbJWejqwqZMIHUihqCt+E=
Only the zombies are important. Looks like unbound-mailcow cannot make DNS requests.
So what is broken? How to check?
"dig" on the host itself is working fine.
Is it only unbound to create those zombies?
Can you give me access again?
as this is ubuntu 18.04LTS, there is systemd-resolved.conf running, which listens (according to /etc/resolv.conf) hardcoded to 127.0.0.53 So probably i have disable resolved and turn towards dnsmasq, since there i could make it listen to 127.0.0.1
unbound-mailcow is used as resolver in mailcow. It prevents people from using public resolvers for RBL lookups (mostly blocked) and gives them a DNSSEC enabled resolver.
Docker containers always query 127.0.0.11 (the Dockerd DNS proxy) no matter their config. You can remove...
dns:
- ${IPV4_NETWORK:-172.22.1}.254
...from docker-compose.yml for each container and run docker-compose up -d
. It will use your systems resolver then but still query 127.0.0.11 inside each container.
if i understand right, the proposal is, not to use unbound by the other containers to perform dns, but to use the vm/docker-host dns? (just to make sure, that i get the right drift. My assumption would have been that there is a way to look for the right information why unbound ist creating zombies.)
To ask again: Removing the "dns:"-statements from the /opt/mailcow-dockerized/docker-compose.yml
What is the consequence of this action?
(What is that 127.0.0.11 about, what do i have to do then? Sorry, i am totally unfamiliar with the networking of docker and i do not get this from the documentation of mailcow. perhaps it's in there, but i am probably looking at the wrong spot.)
in other words.
It will use your systems resolver then but still query 127.0.0.11 inside each container.
what kind of error messages or spam(?) will i have to face as a result of that?
Removing the dns parameters will still use 127.0.0.11 in every containers resolv.conf (that's ok!).
Docker uses an internal DNS proxy, so when you use an external resolver, you are still able to resolve internal names inside containers. 8.8.8.8 wouldn't know who postfix-mailcow is. So it catches those queries to a specific zone and returns its own data. Everything else will be forwarded to 8.8.8.8.
Can you explain?
Reading this i would like to ask how to change internal dns (mailcow) to another dns. Google DNS is not very "private" ...
We don't use Google DNS. This was an example... we use unbound-mailcow.
OK, i was just wondering ... So unbound-mailcow uses dns settings from docker e.g. it's host in /etc/resolv.conf?
1) i tried replacing the systemd-resolved by dnsmasq
i mitigaged the issue now by adding to /etc/crontab:
0 /6 root docker restart mailcowdockerized_unbound-mailcow_1
(this keeps the unber of zombies below 2000)
Can you capture packages and trace the requests?
You already removed the DNS settings from the docker-compose.yml? Unbound is not in use then anymore.
I did remove the " dns: - ${IPV4_NETWORK:-172.22.1}.254" from the docker-compose.yaml plus restared docker are requested. As a result the zombie threads still piled up.
Can you capture packages and trace the requests?
i can pcap with tcpdump. But i have no clue about docker networking. So actually: no i idea how to listen (and from which side) on what interface.
I can also confirm the problem with the zombie processes (on three servers with Mailcow and Ubuntu 18.04).
Example: --> Number of threads 6600 --> Execute the command "docker restart mailcowdockerized_unbound-mailcow_1" --> Number of threads 1040
Still never run into this...
Is disable_monitoring = true in data/conf/rspamd/local.d/options.inc?
Oh, and could you please check your syslog for "more than 100 concurrent queries"? :-)
root@serverxxx:/data/docker_projects/mailcow-dockerized/data/conf/rspamd/local.d# cat options.inc dns { enable_dnssec = true; } map_watch_interval = 30s; dns { timeout = 15s; retransmits = 5; } disable_monitoring = true;
I have not found "more than 100 concurrent queries" in the sylog of the servers
Oh, and could you please check your syslog for "more than 100 concurrent queries"? :-)
I have not found "more than 100 concurrent queries" in the sylog of the servers.
In unbound log I noticed the following message, which was already reported in #585.
unbound[1:0] error: Could not open logfile /dev/stdout: Permission denied
I fixed it in the upcoming image.
@Adorfer
I think the checkmk agent condemns the problem, if you deactivate the checkmk monitoring there is no [timeout] <defunct>
.
disabling of check_mk agent is not an option here. i can just ask for a modification of the agent. What do you propose?
at the moment i am executing in the crontab:
0 /6 root docker restart $(docker ps|grep mailcowdockerized_unbound|cut -c252-|tr -d " ")
(without the number of zombies is still skyrocketing, rendering the system unresponsive sooner or later
if there is a smarter ways to handle this, please let me know.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
issue was resolved somewhat around end of 2018, after pulling/updating beginning of 2019-01 it disappeared. (the zig-zag before was due to the cron-ed restart of the dnsresolver (unbound) container.)
If anybody likes to drill down to the exact commit, fixing the issue: Feel free to feed me instruction how to test. But for now i will be closing the issue.
Strange.
But I have never seen it happening anyway. So..
If anybody likes to drill down to the exact commit, fixing the issue: Feel free to feed me instruction how to test. But for now i will be closing the issue.
It wasn't mailcow, the problem was caused by the check_mk agent. After an update of the agent everything was ok. https://mathias-kettner.de/check_mk-werks.php?edition_id=enterprise&branch=master&version=1.6.0i1&werk_search=docker
In Deutsch "wer misst misst mist" :-)
I see on a vm (ubuntu18.04LTS, 2C, around 5GB RAM) a near constant rate of threads increasing like 2k/24h.
plus a number of zombie-processes (e.g. "There are 272 zombie processes." after 15h uptime) in the end the system is rendered unresponsive (like "no web login for users" + "ssh login takes minutes")
Restarting of the VM cures the effekt, but i like to find the trigger/cause.
(running update.sh today did not change the behaviour.)
What kind of debug information should be provided?