MosheMoradSimgo commented 7 years ago

Hi,

We are running alpine (3.4) in a docker container over a Kubernetes cluster (GCP).

We have been seeing some anomalies where our thread is stuck for 2.5 sec. After some research using strace we saw that DNS resolving gets timed-out once in a while.

Here are some examples:

23:18:27 recvfrom(5, "\f\361\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\1\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\243\213\360\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000045>
23:18:27 recvfrom(5, 0x7ffdd0e1fb90, 512, 0, 0x7ffdd0e1f640, 0x7ffdd0e1f61c) = -1 EAGAIN (Resource temporarily unavailable) <0.000014>
23:18:27 clock_gettime(CLOCK_REALTIME, {1487114307, 714908396}) = 0 <0.000015>
23:18:27 poll([{fd=5, events=POLLIN}], 1, 2499) = 0 (Timeout) <2.502024>

09:04:27 recvfrom(5<UDP:[0.0.0.0:36148]>, "\354\211\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\1\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\244\30\220\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000041>
09:04:27 recvfrom(5<UDP:[0.0.0.0:36148]>, 0x7ffec3d9b0b0, 512, 0, 0x7ffec3d9ab60, 0x7ffec3d9ab3c) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>
09:04:27 clock_gettime(CLOCK_REALTIME, {1487149467, 555317749}) = 0 <0.000008>
09:04:27 poll([{fd=5<UDP:[0.0.0.0:36148]>, events=POLLIN}], 1, 2498) = 0 (Timeout) <2.499671>


09:18:47 recvfrom(5<UDP:[0.0.0.0:47282]>, " B\201\200\0\1\0\1\0\0\0\0\2db\6devone\5*****\3net\0\0\1\0\1\300\f\0\1\0\1\0\0\0\200\0\4h\307\16N", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 53 <0.000011>
09:18:47 recvfrom(5<UDP:[0.0.0.0:47282]>, 0x7ffdd0e1fb90, 512, 0, 0x7ffdd0e1f640, 0x7ffdd0e1f61c) = -1 EAGAIN (Resource temporarily unavailable) <0.000008>
09:18:47 clock_gettime(CLOCK_REALTIME, {1487150327, 679292144}) = 0 <0.000005>
09:18:47 poll([{fd=5<UDP:[0.0.0.0:47282]>, events=POLLIN}], 1, 2497) = 0 (Timeout) <2.498797>

And a good example:

08:22:25 recvfrom(5<UDP:[0.0.0.0:59162]>, "\20j\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\34\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\244\n\200\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000014>
08:22:25 recvfrom(5<UDP:[0.0.0.0:59162]>, 0x7ffec3d9aeb0, 512, 0, 0x7ffec3d9ab60, 0x7ffec3d9ab3c) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>
08:22:25 clock_gettime(CLOCK_REALTIME, {1487146945, 638264715}) = 0 <0.000010>
08:22:25 poll([{fd=5<UDP:[0.0.0.0:59162]>, events=POLLIN}], 1, 2498) = 1 ([{fd=5, revents=POLLIN}]) <0.000010>

In the past we already had some issues with DNS resolving in older an version(3.3), which have been resolved since we moved to 3.4 (or so we thought).

Is this a known issue? Does anybody have a solution / workaround / suggestion what to do?

Thanks a lot.

Sartner commented 7 years ago

Have the same issue Alpine: 3.5 Docker: 1.13.1-cs2

/ # time ping -c 1 dev11
PING dev11 (10.1.100.11): 56 data bytes
64 bytes from 10.1.100.11: seq=0 ttl=63 time=0.211 ms

--- dev11 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.211/0.211/0.211 ms
real    0m 2.50s
user    0m 0.00s
sys     0m 0.00s

rawat-he commented 7 years ago

Hi,

With the latest version (3.5), I am experiencing below error.

fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/community: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/main: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.3/main/x86_64/APKINDEX.tar.gz: DNS lookup error
ERROR: unsatisfiable constraints:
  bash (missing):
    required by: world[bash]
  ca-certificates (missing):
    required by: world[ca-certificates]
  curl (missing):
    required by: world[curl]

Can anyone please help me in resolving it and moving forward

Thanks

andyshinn commented 7 years ago

The latter two comments don't sound like the same issue. This seems like a Kubernetes specific thing. Do you know if it happens to only Alpine containers or does it affect others as well? I've heard of intermittent DNS resolving issues in Kubernetes. But they were not specific to Alpine.

c24w commented 7 years ago

We're seeing slow DNS resolution in alpine:3.4 (not in Kubernetes):

$ time docker run --rm alpine:3.4 nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve    

Name:      google.com        
Address 1: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 2: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 3: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 4: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net

real    0m2.996s             
user    0m0.010s             
sys     0m0.005s

Versus Busybox:

$ time docker run --rm busybox nslookup google.com
Server:    10.108.88.10      
Address 1: 10.108.88.10      

Name:      google.com        
Address 1: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net
Address 2: 216.58.204.78 lhr25s13-in-f14.1e100.net         
Address 3: 216.58.204.78 lhr25s13-in-f14.1e100.net         
Address 4: 216.58.204.78 lhr25s13-in-f14.1e100.net

real    0m0.545s             
user    0m0.011s             
sys     0m0.007s

Not sure what the null error suggests, but it might be related!

Docker version 17.05.0-ce, build 89658be

mpashka commented 7 years ago

I have an issue with DNS resolving in alpine. I have /etc/resolv.conf config with several search suffixes (6 suffixes). And during DNS resolving I see that my DNS server answers only first 6 or 7 requests (this is DNS DoS protection). But according to strace output alpine does 2 requests for each search suffix.

Ubuntu docker image doesn't have this problem - it does only one request for each name suffix.

So is it possible to fix this behaviour and make only 1 request to DNS server for each domain name suffix. This is important because kubernetes usually put 3 search suffixes. So if we have more than one our own search suffixes and we have DNS server that limits requests from single IP than most likely we get DNS resolution problem.

justlooks commented 7 years ago

yes ，latest alpine image has problem in DNS resolve ,all my app image build on alpine have same problem on kubernetes v1.7.0


[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup heapster.kube-system
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      heapster.kube-system
Address 1: 10.100.249.248 heapster.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup http-svc.kube-system
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      http-svc.kube-system
Address 1: 10.102.217.7 http-svc.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup ftpserver-service.demo
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'ftpserver-service.demo'

mpashka commented 7 years ago

During my investigations I've found that I have a problem with my DNS server. Some time ago alpine didn't support resolv.conf options 'search' and 'domains'. But that is not the case now. They also claim they do resolving in parallel and thus results can differ. But this is not the case for me also. I've found that alpine makes 2 requests because one is for ipv4 (A record) and other is for ipv6 (AAAA record). My trouble is related to DNS server itself. If there are several search domains in resolv.conf and for some of that domains DNS server reports 'Server failure' (RCODE = 2) then alpine retries this name. If DNS server reports 'No such name' (RCODE = 3) then alpine continues with next search domain. Ubuntu on the other hand doesn't treat 'Server failure' (RCODE = 2) as DNS server failure and just coninues to fetch other search domains. You can check DNS server rcode result for some specific dns name using command

dig @ dns_name_to_check

and check 'status:' field - it can be NXDOMAIN (which is 'No such name' RCODE = 3) or SERVFAIL. BTW nslookup operates in the same manner. It respects RCODE and stopps if DNS server responce 'Server failure' (RCODE = 2)

zq-david-wang commented 6 years ago

I tried on alpine-docker 3.7, with /etc/resolv.conf as follow:

nameserver 10.254.0.100
search  localdomain  somebaddomain
options ndots:5

My DNS server "10.254.0.100" manage its own domain 'localdomain' while forward query of other domain to some external dns server. Then when I query google.com, alpine dnsclient would

try google.com.localdomain, and get a "NXDomain" response
try google.com.somebaddomain, but get a "Refused" response, but after receive a "Refused/SERVFAIL" response, alpine client would keep retry "google.com.somebaddomain", resulting in the final failure.

I also try centos/ubuntu docker image, those dns client would giveup those "Refused/Servfail" response and keep next trial of "google.com" and got an expected response.

Is it the secure/expect reaction to retry same dns after receiving "Refused/Servfail" response or it is a bug in alpine.

KIVagant commented 6 years ago

We got probably the same issue. Two different containers running in the same cluster in parallel:

image with 3.5.2 works normal, AWS DNS resolves in 0.01s
image with 3.7.0 has big lag, DNS could be resolved in 5 seconds or could not be resolved at all.

zioalex commented 6 years ago

For the DNS delay try to add the line: options single-request in the resolv.conf See https://wiki.archlinux.org/index.php/Domain_name_resolution#Hostname_lookup_delayed_with_IPv6

joshbenner commented 6 years ago

I don't think musl (which is used by Alpine) has the single-request resolver option.

zq-david-wang commented 6 years ago

I tried following changes, it seems work. (Tried on my cluster and push to davidzqwang/alpine-dns:3.7)

diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c
index 209c20f..abb7da5 100644
--- a/src/network/lookup_name.c
+++ b/src/network/lookup_name.c
@@ -202,7 +202,7 @@ static int name_from_dns_search(struct address buf[static MAXADDRS], char canon[
                        memcpy(canon+l+1, p, z-p);
                        canon[z-p+1+l] = 0;
                        int cnt = name_from_dns(buf, canon, canon, family, &conf);
-                       if (cnt) return cnt;
+                       if (cnt > 0 || cnt == EAI_AGAIN) return cnt;
                }
        }

runephilosof commented 6 years ago

I have tested 3.6, 3.7 and edge and all are affected by https://bugs.busybox.net/show_bug.cgi?id=675. Alpine 3.7, and edge use BusyBox v1.27.2 (2017-12-12 10:41:50 GMT) multi-call binary., but if I pulll busybox:1.27.2 and test nslookup, it doesn't have the error. So I am not sure if just upgrading busybox will fix the issue. The busybox bug report hints that the libc in use will influence the problem.

krikri90 commented 6 years ago

fetch http://mirror.ps.kz/alpine/v3.8/main/x86_64/APKINDEX.tar.gz ERROR: http://mirror.ps.kz/alpine/v3.8/main: DNS lookup error WARNING: Ignoring APKINDEX.1b054110.tar.gz: No such file or directory fetch http://mirror.ps.kz/alpine/v3.8/community/x86_64/APKINDEX.tar.gz ERROR: http://mirror.ps.kz/alpine/v3.8/community: DNS lookup error WARNING: Ignoring APKINDEX.ce38122e.tar.gz: No such file or directory

Getting above error. How to fix it

sadok-f commented 6 years ago

Hi,

We're running a couple of Docker container on AWS EC2, the images based on Alpine3.7. The DNS resolution is very slow, here an example:

time nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve

Name:      google.com
Address 1: 216.58.207.174 muc11s04-in-f14.1e100.net
Address 2: 2a00:1450:4016:80a::200e muc11s12-in-x0e.1e100.net
real    0m 2.53s
user    0m 0.00s
sys     0m 0.00s

Another test by curl cmd:

time curl https://packagist.org/packages/list.json?vendor=composer  --output list.json
% Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current
                             Dload  Upload   Total   Spent    Left  
Speed
100   174    0   174    0     0     58      0 --:--:--  0:00:03 --:--:--    48
real    0m 3.61s
user    0m 0.01s
sys 0m 0.00s

Which is interesting if we put -4 option for curl which for resolving the address to IPV4, the result is much faster as it should be:

time curl -4 https://packagist.org/packages/list.json?vendor=composer  --output list.json
% Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current
                             Dload  Upload   Total   Spent    Left  
Speed
100   174    0   174    0     0    174      0 --:--:-- --:--:-- --:--:--  1359
real    0m 0.13s
user    0m 0.01s
sys 0m 0.00s

There's a workaround proposed here: https://github.com/gliderlabs/docker-alpine/issues/313#issuecomment-409872142

Is there any soonish release to fix that? Thx

bboreham commented 6 years ago

FYI @brb has found some kernel race conditions which relate to this symptom. See https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts for technical details

zhouqiang-cl commented 6 years ago

I found if i install bind-tools it will all be ok RUN apk add bind-tools

sebastianfuss commented 6 years ago

@zhouqiang-cl Unfortunately RUN apk add bind-tools does not solve my name resolution problems. I am running a container with Alpine 3.8 in AWS Fargate and i am getting errors during resolving hostnames.

EDIT: I moved as well to debian stretch slim and my dns problems seems to be solved.

jurgenweber commented 6 years ago

I have converted a few images to Debian Jessie/Stretch slim and my DNS issues went away. Kubernetes 1.9.7 using kops in AWS. This has been bothering us for a long while.

based64god commented 6 years ago

I too am seeing issues with MUSL DNS failure on a bare-metal Kubernetes cluster. The hosts in the cluster are all Ubuntu 18.04 machines using systemd-resolved for local DNS. I can reproduce the issue @sadok-f is having. This is on a Kubernetes 1.11.3 cluster (set up using Kubeadm 1.11.3, with Weave CNI), CoreDNS 1.1.3, systemd 237 on the host. Swapping images out for Debian stretch slim fixes the issues.

jstoja commented 6 years ago

@zhouqiang-cl @sebastianfuss installing bind-tools just seem to use a statically built binary seem to only solve the nslookup command but not the underlying issue.

chenyongze commented 6 years ago

ERROR: tzdata-2018d-r1: temporary error (try again later)

mblaschke commented 6 years ago

Can confirm the issue running multiple alpine containers in a Kubernetes cluster. Busybox images are fine, only Alpine is affected.

swift1911 commented 6 years ago

Is any progress for this issue? In my test, a newer musl version can solve this problem

jstoja commented 6 years ago

@swift1911 could you share with us the test you used and the version of alpine+musl that you used? That would be of tremendeous help to check for a fix!

Mykolaichenko commented 6 years ago

Guys how we can push that? It's extremely huge problem!

ncopa commented 6 years ago

Is there any way to reproduce this without using kubernetes?

Alternatively, does anyone have a tcpdump trace that shows exactly what is going on?

brb commented 6 years ago

@ncopa You can use the client and the server from https://github.com/brb/conntrack-race to reproduce the issue w/o k8s.

tecnobrat commented 5 years ago

I don't know if this will help anyone else, but we found if we ran any alpine-based docker image on-top of amazon's ECS AMI, that we would get a 400ms timeout set in DNS resolution, but we cannot find out where its coming from.

Our resolv.conf looks like:

~ $ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /sbin/dhclient-script
search ec2.internal
nameserver 172.16.0.2

If we use an ubuntu-based image we don't have this issue:

$ sudo iptables -I FORWARD -p udp --sport 53 -j DROP
$ sudo docker run -it bash
bash-4.4# ping tugboat.info
ping: bad address 'tugboat.info'
bash-4.4# ping tecnobrat.com
ping: bad address 'tecnobrat.com'
bash-4.4# exit
exit
[status stage bstolz@ip-172-17-50-25 ~]$ sudo iptables -D FORWARD -p udp --sport 53 -j DROP

You can see from the wireshark that it sends a request every 400ms instead of ever 2 seconds like in our resolv.conf

I'm not sure whats causing it, but its causing a lot of DNS timeouts for us.

tecnobrat commented 5 years ago

I just realized that options timeout:2 attempts:5 which means: 2s = 2000ms 2000 / 5 = 400ms

Is alpine using an OVERALL timeout of 2 seconds, and then attempting to accomplish 5 attempts within that 2 seconds? Instead of 2 seconds per attempt?

tecnobrat commented 5 years ago

I believe this is the case, according to https://git.musl-libc.org/cgit/musl/tree/src/network/res_msend.c#n111

Which means its fundamentally different than ubuntu and other glibc OS's.

tecnobrat commented 5 years ago

We "fixed" this in the AWS ECS AMI by simply removing the options line they add there, so that containers and the host use the defaults, which are much more sane.

johansmitsnl commented 5 years ago

@tecnobrat can you share the solution?

tecnobrat commented 5 years ago

@johansmitsnl as mentioned, I simply removed the options line out of the /etc/resolv.conf

We build our own version of the AWS ECS AMI. So that "fixed" it because it just allows every OS to use the defaults, which are much saner.

minherz commented 5 years ago

We experience transient errors when trying to resolve hosts within GCP VPC using alpine:3.7 image. We run a container with

docker run --rm --name test1 -it alpine:3.7 /bin/sh

and get /etc/resolv.conf file is auto-generated that looks like:

nameserver 10.235.117.22
# Generated by NetworkManager
search c.<project_name>.internal google.internal
nameserver 169.254.169.254

Out of 10 attempts to call nslookup <hostname> where is another host in the same project, we get 8 or 9 errors. We tried it with 3.8 as well with the same results.

swift1911 commented 5 years ago

@jstoja In my testing in my production. musl of commit de7dc1318f493184b20f7661bc12b1829b957b67 works well

i have push an image to test this. https://hub.docker.com/r/swift1911/alpine

swift1911 commented 5 years ago

is any progress about this issue?

o-Epictetus-o commented 5 years ago

How is this still open since 02.19.17!?

sfxworks commented 5 years ago

Getting slow/unresolvable for alpine too. Just now a simple wget on alpine vs busybox shows alpine failing the lookup. Been trying to figure out why I've been having resolve issues for months. Glad I found this thread and a workaround.

brb commented 5 years ago

The kernel patches to mitigate the problem have been released in Linux kernel 5.0 (https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-462252499).

minherz commented 5 years ago

does it require a new kubernetes version, a new alpine image or what?

brb commented 5 years ago

does it require a new kubernetes version, a new alpine image or what?

Kernel of a host machine running containers needs to be updated to 5.0.

minherz commented 5 years ago

does it require a new kubernetes version, a new alpine image or what?

Kernel of a host machine running containers needs to be updated to 5.0.

This is hard requirement. It is especially tricky for Cloud environments like those that run the managed version of the Kubernetes. Is it possible to upgrade a specific module instead?

itskingori commented 5 years ago

For all those struggling to know what direction to take, my recommendation is move off Alpine (for now) ... this, realistically, is the low hanging fruit.

I started using Alpine when the Docker images of other distros were rather large but you'd be surprised what the sizes are now e.g. ubuntu:19.04 is 30MB (instead of 300MB+). Admittedly, this doesn't beat alpine:3.8 which is 2MB. But ... 🤷‍♂️... I guess you choose your struggle. 😅

FernandoMiguel commented 5 years ago

just for sanity

ubuntu                                                                       devel                        d6e206581aca        3 weeks ago         75.9MB
ubuntu                                                                       rolling                      09798120c134        3 weeks ago         73.7MB
alpine                                                                       3.9                          caf27325b298        4 weeks ago         5.53MB
alpine                                                                       latest                       caf27325b298        4 weeks ago         5.53MB

minherz commented 5 years ago

try to run security scan on ubuntu vs alpine and see the difference :)

itskingori commented 5 years ago

try to run security scan on ubuntu vs alpine and see the difference :)

Fair point (I'm guessing Alpine wins here). 😅

But, as I said ... this is easier than upgrading your kernel especially in (to quote you) "cloud environments like those that run the managed version of the Kubernetes".

Also depends on what you consider unacceptable i.e. dns-resolving/networking issues are considered a high-level bug from a production point of view (at least in our case). Security in my experience is not always pragmatic.

brb commented 5 years ago

Is it possible to upgrade a specific module instead?

@minherz yes, you can upgrade nf_conntrack and nf_nat.

arnaudlacour commented 5 years ago

what is the procedure to do upgrade nf_conntrack and nf_nat
is it shown to fully address this crippling issue?

jfbus commented 5 years ago

Note for Go users: the pure Go 1.13 DNS resolver will search for the use-vc (golang/go#29594) and single-request (golang/go#29661) options in resolv.conf and emulate the glibc resolver.

use-vc switches DNS resolutions from UDP to TCP, and single-request forces sequential A and AAAA queries instead of parallel queries.

gliderlabs / docker-alpine

DNS Issue #255

dig @ dns_name_to_check