Open jinuxstyle opened 6 years ago
Can anyone comment on this issue? It's really a critical and annoying issue in our use case, and we have no idea what might cause it.
Could anyone you help?
Similar issue: https://github.com/moby/moby/issues/31560
As an alternative way to make things work, we are also trying to change the network mode in our docker-compose files so that the containers composed together could reach each other by localhost/127.0.0.1 and no longer rely on the dns resolver. To use this method, we want to totally disable the dns resolver. I found there was already someone proposing adding an option for disabling it. Add the issue here for reference: https://github.com/docker/libnetwork/issues/1085
@jinuxstyle for the original issue, can you check if you get a consistent repro in latest versions of docker like 17.12 please and come up with some repro steps to isolate the situation?
Also do you notice any interesting errors in the docker daemon log when the above situation happened?
@ddebroy I will try the newer version as you suggested. Regarding reproducing, it only appeared in our production environment under high workload for hours and sometimes for days, and we have not figured out the steps to reproduce it.
And the logs, we have not got any interesting logs yet because we are using journald as the log driver, and the docker logs are overwhelmed by the flooding service log. I will post it here once we get any interesting logs of the docker daemon.
I'm also seeing this, this is an obnoxious problem. I see the following in my logs when I start my compose containers:
Feb 13 11:49:04 tjnii-debvm1 dockerd[564]: time="2018-02-13T11:49:04.072007695-05:00" level=warning msg="unknown container" container=18e49573fd1ef963d9137be1ee0f909226ec4d70894ccd955a58b9bb2f107c8b module=libcontainerd namespace=plugins.moby
Feb 13 11:49:04 tjnii-debvm1 dockerd[564]: time="2018-02-13T11:49:04-05:00" level=info msg="shim reaped" id=18e49573fd1ef963d9137be1ee0f909226ec4d70894ccd955a58b9bb2f107c8b module="containerd/tasks"
Feb 13 11:49:04 tjnii-debvm1 dockerd[564]: time="2018-02-13T11:49:04.102107146-05:00" level=info msg="ignoring event" module=libcontainerd namespace=plugins.moby topic=/tasks/delete type="*events.TaskDelete"
Feb 13 11:49:04 tjnii-debvm1 dockerd[564]: time="2018-02-13T11:49:04.102561846-05:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 13 11:49:04 tjnii-debvm1 dockerd[564]: time="2018-02-13T11:49:04.160180997-05:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/containers/delete type="*events.ContainerDelete"
The 127.0.0.11 nameserver appears to be refusing connections:
[root@0be2964c448f /]# dig A docker.io @127.0.0.11
; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7_4.2 <<>> A docker.io @127.0.0.11
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 53852
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;docker.io. IN A
;; Query time: 4 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Tue Feb 13 16:50:53 UTC 2018
;; MSG SIZE rcvd: 38
I do not see any port 53 NAT rules.
# iptables -L -n -v -t nat | grep 53
#
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:11:19 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:54 2017
OS/Arch: linux/amd64
Experimental: false
I tracked my problem to the first resolver in my host resolv.conf returning the refused response, and the Docker DNS abstraction not allowing the container to fail over to the other DNS servers defined.
For the DNS related logs, you need to enable debug logs for the deamon (https://success.docker.com/article/How_do_I_enable_'debug'_logging_of_the_Docker_daemon) and look for log entries prefixed with resolver
.
Example:
Feb 13 17:08:49 vm0 dockerd[19974]: time="2018-02-13T17:08:49.835865664Z" level=debug msg="[resolver] query docker.io. (A) from 172.18.0.4:60740, forwarding to udp:168.63.129.16"
Feb 13 17:08:49 vm0 dockerd[19974]: time="2018-02-13T17:08:49.880953768Z" level=debug msg="[resolver] received A record \"54.236.81.192\" for \"docker.io.\" from udp:168.63.129.16"
Feb 13 17:08:49 vm0 dockerd[19974]: time="2018-02-13T17:08:49.881174378Z" level=debug msg="[resolver] received A record \"34.234.103.99\" for \"docker.io.\" from udp:168.63.129.16"
Feb 13 17:08:49 vm0 dockerd[19974]: time="2018-02-13T17:08:49.881384887Z" level=debug msg="[resolver] received A record \"52.3.153.154\" for \"docker.io.\" from udp:168.63.129.16"
Feb 13 17:11:52 vm0 dockerd[19974]: time="2018-02-13T17:11:52.895153955Z" level=debug msg="[resolver] query google.com. (A) from 172.18.0.4:47000, forwarding to udp:168.63.129.16"
Feb 13 17:11:52 vm0 dockerd[19974]: time="2018-02-13T17:11:52.947160554Z" level=debug msg="[resolver] received A record \"216.58.195.238\" for \"google.com.\" from udp:168.63.129.16"
@RootTJNII Docker resolver should try all the external DNS servers. Do you see anything in the resolver logs after enabling debug setting as mentioned?
@jinuxstyle @RootTJNII do you guys have any update on this?
The namespaces ...
# nsenter -n -t $(docker inspect --format {{.State.Pid}} mycontainerA) ss -tulpan |grep ockerd
udp UNCONN 11520 0 127.0.0.11:37169 0.0.0.0:* users:(("dockerd",pid=3620,fd=56))
tcp LISTEN 0 128 127.0.0.11:42029 0.0.0.0:* users:(("dockerd",pid=3620,fd=58))
Notice the 37169/udp
and 42029/tcp
are for the mycontainerA
. For your container the ports will be different.
# nsenter -n -t $(docker inspect --format {{.State.Pid}} mycontainerA) dig -p 42029 +tcp +short google.com @127.0.0.11
216.58.204.110
# nsenter -n -t $(docker inspect --format {{.State.Pid}} mycontainerA) dig -p 37169 +short google.com @127.0.0.11
216.58.204.110
Make sure you have set the correct ports, in my case 37169/udp and 42029/tcp.
echo "*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER_OUTPUT - [0:0]
:DOCKER_POSTROUTING - [0:0]
-A OUTPUT -d 127.0.0.11/32 -j DOCKER_OUTPUT
-A POSTROUTING -d 127.0.0.11/32 -j DOCKER_POSTROUTING
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:42029
-A DOCKER_OUTPUT -d 127.0.0.11/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:37169
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p tcp -m tcp --sport 42029 -j SNAT --to-source :53
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p udp -m udp --sport 37169 -j SNAT --to-source :53
COMMIT" | nsenter -n -t $(docker inspect --format {{.State.Pid}} mycontainerA) iptables-restore
This will work as long as the above steps worked.
# nsenter -n -t $(docker inspect --format {{.State.Pid}} mycontainerA) dig +short google.com @127.0.0.11
216.58.204.110
# nsenter -n -t $(docker inspect --format {{.State.Pid}} mycontainerA) dig +tcp +short google.com @127.0.0.11
216.58.204.110
I installed a fresh version of docker and I am seeing this.
I just found this issue from 2018 when my containers suddenly stopped resolving each other without me modifying anything on a previously working setup. An earlier apt upgrade may have been involved, with perhaps my containers only experiencing the issue when they eventually got recreated.
The resolver is responding however, and internet access works from within the containers - only the other container names fail to resolve.
A work around that works for me,
# /etc/systemd/system/docker.service.d/docker.root.conf
[Unit]
After=network-online.target
[Service]
ExecStartPre=/bin/sleep 10
The key thing that has helped is the 10 second sleep before starting the docker service.
What does this accomplish?
Well, prior to adding this, DNS wouldn't resolve for 1 or more containers on the host after the server was booted and docker started. Interesting, there were cases where it would resolve for some containers but not all. Seems like it could be a race condition of some kind. Perhaps some resolver service not running at the time that the containers are spun up.
Running sudo systemctl restart docker
would fix the issue until the system was rebooted and it would then reoccur.
The solution above seems to help out with the race condition, I assume it allows the resolver service to finish starting up before containers are spun up. The end result is that I no longer have issues with DNS not resolving in containers after a reboot.
I tried everything, restarting the whole server, restarting the docker daemon, restarting the container, recreating the container, wiping and recreating the entire dockerc-compose stack. Nothing fixes my issue.
Have you tried updating the docker DNS settings?
# /etc/docker/daemon.json
{
"dns": ["8.8.8.8"]
}
The restart docker,
sudo systemctl daemon-reload
sudo systemctl restart docker
My containers can access the internet just fine, they just fail to resolve each other. So, my upstream DNS setting is irrelevant here.
I set dockerd to debug logging, and watched the journal while doing DNS lookups of other containers from a container of a stack. Nothing gets logged apparently, while the names don't get resolved.
Client: Docker Engine - Community
Version: 27.0.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.15.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.28.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Server Version: 27.0.3
Storage Driver: overlay2
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Swarm: active
Is Manager: true
Managers: 1
Nodes: 3
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 2bf793ef6dc9a18e00cb12efb64355c2c9d5eb41
runc version: v1.1.13-0-g58aa920
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.6.31-current-odroidxu4
Operating System: Armbian 23.8.1 bullseye
OSType: linux
Architecture: armv7l
This is really weird. Restarting my worker nodes solved my issue.
Hi
I have a service started via docker-compose. It has multiple containers and a user defined network of type bridge. A strange issue appeared that the containers could not reach each other by service names after running for hours or days and restarted, neither reach any hostname outside containers. It can only be recovered by restarting docker daemon.
I learned a bit about the name resolving mechanism between the containers on the same user-defined network, and found it relies on a internal dns resolver inside each container and the resolver services on address 127.0.0.11:53. So I looked into containers that worked well and found there are corresponding iptables rules for the address. But when the issue occurred, I could not see these info and the resolver seemed not started.
So far, based on my investigation, I suspect that the resolver is not started under some particular conditions. But I don't know what are the possible reasons that might cause the resolver not started when starting a container on a user-defined container. Any ideas or insights?