Open MosheMoradSimgo opened 7 years ago
Have the same issue Alpine: 3.5 Docker: 1.13.1-cs2
/ # time ping -c 1 dev11
PING dev11 (10.1.100.11): 56 data bytes
64 bytes from 10.1.100.11: seq=0 ttl=63 time=0.211 ms
--- dev11 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.211/0.211/0.211 ms
real 0m 2.50s
user 0m 0.00s
sys 0m 0.00s
Hi,
With the latest version (3.5), I am experiencing below error.
fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/community: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/main: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.3/main/x86_64/APKINDEX.tar.gz: DNS lookup error
ERROR: unsatisfiable constraints:
bash (missing):
required by: world[bash]
ca-certificates (missing):
required by: world[ca-certificates]
curl (missing):
required by: world[curl]
Can anyone please help me in resolving it and moving forward
Thanks
The latter two comments don't sound like the same issue. This seems like a Kubernetes specific thing. Do you know if it happens to only Alpine containers or does it affect others as well? I've heard of intermittent DNS resolving issues in Kubernetes. But they were not specific to Alpine.
We're seeing slow DNS resolution in alpine:3.4 (not in Kubernetes):
$ time docker run --rm alpine:3.4 nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve
Name: google.com
Address 1: 216.58.204.78 lhr25s13-in-f78.1e100.net
Address 2: 216.58.204.78 lhr25s13-in-f78.1e100.net
Address 3: 216.58.204.78 lhr25s13-in-f78.1e100.net
Address 4: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net
real 0m2.996s
user 0m0.010s
sys 0m0.005s
Versus Busybox:
$ time docker run --rm busybox nslookup google.com
Server: 10.108.88.10
Address 1: 10.108.88.10
Name: google.com
Address 1: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net
Address 2: 216.58.204.78 lhr25s13-in-f14.1e100.net
Address 3: 216.58.204.78 lhr25s13-in-f14.1e100.net
Address 4: 216.58.204.78 lhr25s13-in-f14.1e100.net
real 0m0.545s
user 0m0.011s
sys 0m0.007s
Not sure what the null
error suggests, but it might be related!
Docker version 17.05.0-ce, build 89658be
I have an issue with DNS resolving in alpine. I have /etc/resolv.conf config with several search suffixes (6 suffixes). And during DNS resolving I see that my DNS server answers only first 6 or 7 requests (this is DNS DoS protection). But according to strace output alpine does 2 requests for each search suffix.
Ubuntu docker image doesn't have this problem - it does only one request for each name suffix.
So is it possible to fix this behaviour and make only 1 request to DNS server for each domain name suffix. This is important because kubernetes usually put 3 search suffixes. So if we have more than one our own search suffixes and we have DNS server that limits requests from single IP than most likely we get DNS resolution problem.
yes ,latest alpine image has problem in DNS resolve ,all my app image build on alpine have same problem on kubernetes v1.7.0
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo -- nslookup heapster.kube-system
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: heapster.kube-system
Address 1: 10.100.249.248 heapster.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo -- nslookup http-svc.kube-system
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: http-svc.kube-system
Address 1: 10.102.217.7 http-svc.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo -- nslookup ftpserver-service.demo
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
nslookup: can't resolve 'ftpserver-service.demo'
During my investigations I've found that I have a problem with my DNS server. Some time ago alpine didn't support resolv.conf options 'search' and 'domains'. But that is not the case now. They also claim they do resolving in parallel and thus results can differ. But this is not the case for me also. I've found that alpine makes 2 requests because one is for ipv4 (A record) and other is for ipv6 (AAAA record). My trouble is related to DNS server itself. If there are several search domains in resolv.conf and for some of that domains DNS server reports 'Server failure' (RCODE = 2) then alpine retries this name. If DNS server reports 'No such name' (RCODE = 3) then alpine continues with next search domain. Ubuntu on the other hand doesn't treat 'Server failure' (RCODE = 2) as DNS server failure and just coninues to fetch other search domains. You can check DNS server rcode result for some specific dns name using command
and check 'status:' field - it can be NXDOMAIN (which is 'No such name' RCODE = 3) or SERVFAIL. BTW nslookup operates in the same manner. It respects RCODE and stopps if DNS server responce 'Server failure' (RCODE = 2)
I tried on alpine-docker 3.7, with /etc/resolv.conf as follow:
nameserver 10.254.0.100
search localdomain somebaddomain
options ndots:5
My DNS server "10.254.0.100" manage its own domain 'localdomain' while forward query of other domain to some external dns server. Then when I query google.com, alpine dnsclient would
I also try centos/ubuntu docker image, those dns client would giveup those "Refused/Servfail" response and keep next trial of "google.com" and got an expected response.
Is it the secure/expect reaction to retry same dns after receiving "Refused/Servfail" response or it is a bug in alpine.
We got probably the same issue. Two different containers running in the same cluster in parallel:
For the DNS delay try to add the line:
options single-request
in the resolv.conf
See https://wiki.archlinux.org/index.php/Domain_name_resolution#Hostname_lookup_delayed_with_IPv6
I don't think musl (which is used by Alpine) has the single-request
resolver option.
I tried following changes, it seems work. (Tried on my cluster and push to davidzqwang/alpine-dns:3.7)
diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c
index 209c20f..abb7da5 100644
--- a/src/network/lookup_name.c
+++ b/src/network/lookup_name.c
@@ -202,7 +202,7 @@ static int name_from_dns_search(struct address buf[static MAXADDRS], char canon[
memcpy(canon+l+1, p, z-p);
canon[z-p+1+l] = 0;
int cnt = name_from_dns(buf, canon, canon, family, &conf);
- if (cnt) return cnt;
+ if (cnt > 0 || cnt == EAI_AGAIN) return cnt;
}
}
I have tested 3.6, 3.7 and edge and all are affected by https://bugs.busybox.net/show_bug.cgi?id=675.
Alpine 3.7, and edge use BusyBox v1.27.2 (2017-12-12 10:41:50 GMT) multi-call binary.
, but if I pulll busybox:1.27.2 and test nslookup, it doesn't have the error.
So I am not sure if just upgrading busybox will fix the issue.
The busybox bug report hints that the libc in use will influence the problem.
fetch http://mirror.ps.kz/alpine/v3.8/main/x86_64/APKINDEX.tar.gz ERROR: http://mirror.ps.kz/alpine/v3.8/main: DNS lookup error WARNING: Ignoring APKINDEX.1b054110.tar.gz: No such file or directory fetch http://mirror.ps.kz/alpine/v3.8/community/x86_64/APKINDEX.tar.gz ERROR: http://mirror.ps.kz/alpine/v3.8/community: DNS lookup error WARNING: Ignoring APKINDEX.ce38122e.tar.gz: No such file or directory
Getting above error. How to fix it
Hi,
We're running a couple of Docker container on AWS EC2, the images based on Alpine3.7. The DNS resolution is very slow, here an example:
time nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve
Name: google.com
Address 1: 216.58.207.174 muc11s04-in-f14.1e100.net
Address 2: 2a00:1450:4016:80a::200e muc11s12-in-x0e.1e100.net
real 0m 2.53s
user 0m 0.00s
sys 0m 0.00s
Another test by curl cmd:
time curl https://packagist.org/packages/list.json?vendor=composer --output list.json
% Total % Received % Xferd Average Speed Time Time Time
Current
Dload Upload Total Spent Left
Speed
100 174 0 174 0 0 58 0 --:--:-- 0:00:03 --:--:-- 48
real 0m 3.61s
user 0m 0.01s
sys 0m 0.00s
Which is interesting if we put -4
option for curl which for resolving the address to IPV4, the result is much faster as it should be:
time curl -4 https://packagist.org/packages/list.json?vendor=composer --output list.json
% Total % Received % Xferd Average Speed Time Time Time
Current
Dload Upload Total Spent Left
Speed
100 174 0 174 0 0 174 0 --:--:-- --:--:-- --:--:-- 1359
real 0m 0.13s
user 0m 0.01s
sys 0m 0.00s
There's a workaround proposed here: https://github.com/gliderlabs/docker-alpine/issues/313#issuecomment-409872142
Is there any soonish release to fix that? Thx
FYI @brb has found some kernel race conditions which relate to this symptom. See https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts for technical details
I found if i install bind-tools it will all be ok
RUN apk add bind-tools
@zhouqiang-cl
Unfortunately RUN apk add bind-tools
does not solve my name resolution problems. I am running a container with Alpine 3.8 in AWS Fargate and i am getting errors during resolving hostnames.
EDIT: I moved as well to debian stretch slim and my dns problems seems to be solved.
I have converted a few images to Debian Jessie/Stretch slim and my DNS issues went away. Kubernetes 1.9.7 using kops in AWS. This has been bothering us for a long while.
I too am seeing issues with MUSL DNS failure on a bare-metal Kubernetes cluster. The hosts in the cluster are all Ubuntu 18.04 machines using systemd-resolved
for local DNS. I can reproduce the issue @sadok-f is having. This is on a Kubernetes 1.11.3 cluster (set up using Kubeadm 1.11.3, with Weave CNI), CoreDNS 1.1.3, systemd 237 on the host. Swapping images out for Debian stretch slim fixes the issues.
@zhouqiang-cl @sebastianfuss installing bind-tools
just seem to use a statically built binary seem to only solve the nslookup
command but not the underlying issue.
ERROR: tzdata-2018d-r1: temporary error (try again later)
Can confirm the issue running multiple alpine containers in a Kubernetes cluster. Busybox images are fine, only Alpine is affected.
Is any progress for this issue? In my test, a newer musl version can solve this problem
@swift1911 could you share with us the test you used and the version of alpine+musl that you used? That would be of tremendeous help to check for a fix!
Guys how we can push that? It's extremely huge problem!
Is there any way to reproduce this without using kubernetes?
Alternatively, does anyone have a tcpdump trace that shows exactly what is going on?
@ncopa You can use the client and the server from https://github.com/brb/conntrack-race to reproduce the issue w/o k8s.
I don't know if this will help anyone else, but we found if we ran any alpine-based docker image on-top of amazon's ECS AMI, that we would get a 400ms timeout set in DNS resolution, but we cannot find out where its coming from.
Our resolv.conf looks like:
~ $ cat /etc/resolv.conf
options timeout:2 attempts:5
; generated by /sbin/dhclient-script
search ec2.internal
nameserver 172.16.0.2
If we use an ubuntu-based image we don't have this issue:
$ sudo iptables -I FORWARD -p udp --sport 53 -j DROP
$ sudo docker run -it bash
bash-4.4# ping tugboat.info
ping: bad address 'tugboat.info'
bash-4.4# ping tecnobrat.com
ping: bad address 'tecnobrat.com'
bash-4.4# exit
exit
[status stage bstolz@ip-172-17-50-25 ~]$ sudo iptables -D FORWARD -p udp --sport 53 -j DROP
You can see from the wireshark that it sends a request every 400ms instead of ever 2 seconds like in our resolv.conf
I'm not sure whats causing it, but its causing a lot of DNS timeouts for us.
I just realized that options timeout:2 attempts:5
which means:
2s = 2000ms
2000 / 5 = 400ms
Is alpine using an OVERALL timeout of 2 seconds, and then attempting to accomplish 5 attempts within that 2 seconds? Instead of 2 seconds per attempt?
I believe this is the case, according to https://git.musl-libc.org/cgit/musl/tree/src/network/res_msend.c#n111
Which means its fundamentally different than ubuntu and other glibc OS's.
We "fixed" this in the AWS ECS AMI by simply removing the options
line they add there, so that containers and the host use the defaults, which are much more sane.
@tecnobrat can you share the solution?
@johansmitsnl as mentioned, I simply removed the options
line out of the /etc/resolv.conf
We build our own version of the AWS ECS AMI. So that "fixed" it because it just allows every OS to use the defaults, which are much saner.
We experience transient errors when trying to resolve hosts within GCP VPC using alpine:3.7 image. We run a container with
docker run --rm --name test1 -it alpine:3.7 /bin/sh
and get /etc/resolv.conf
file is auto-generated that looks like:
nameserver 10.235.117.22
# Generated by NetworkManager
search c.<project_name>.internal google.internal
nameserver 169.254.169.254
Out of 10 attempts to call nslookup <hostname>
where
@jstoja In my testing in my production. musl of commit de7dc1318f493184b20f7661bc12b1829b957b67 works well
i have push an image to test this. https://hub.docker.com/r/swift1911/alpine
is any progress about this issue?
How is this still open since 02.19.17!?
Getting slow/unresolvable for alpine too. Just now a simple wget on alpine vs busybox shows alpine failing the lookup. Been trying to figure out why I've been having resolve issues for months. Glad I found this thread and a workaround.
The kernel patches to mitigate the problem have been released in Linux kernel 5.0 (https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-462252499).
does it require a new kubernetes version, a new alpine image or what?
does it require a new kubernetes version, a new alpine image or what?
Kernel of a host machine running containers needs to be updated to 5.0.
does it require a new kubernetes version, a new alpine image or what?
Kernel of a host machine running containers needs to be updated to 5.0.
This is hard requirement. It is especially tricky for Cloud environments like those that run the managed version of the Kubernetes. Is it possible to upgrade a specific module instead?
For all those struggling to know what direction to take, my recommendation is move off Alpine (for now) ... this, realistically, is the low hanging fruit.
I started using Alpine when the Docker images of other distros were rather large but you'd be surprised what the sizes are now e.g. ubuntu:19.04
is 30MB (instead of 300MB+). Admittedly, this doesn't beat alpine:3.8
which is 2MB. But ... 🤷♂️... I guess you choose your struggle. 😅
just for sanity
ubuntu devel d6e206581aca 3 weeks ago 75.9MB
ubuntu rolling 09798120c134 3 weeks ago 73.7MB
alpine 3.9 caf27325b298 4 weeks ago 5.53MB
alpine latest caf27325b298 4 weeks ago 5.53MB
try to run security scan on ubuntu vs alpine and see the difference :)
try to run security scan on ubuntu vs alpine and see the difference :)
Fair point (I'm guessing Alpine wins here). 😅
But, as I said ... this is easier than upgrading your kernel especially in (to quote you) "cloud environments like those that run the managed version of the Kubernetes".
Also depends on what you consider unacceptable i.e. dns-resolving/networking issues are considered a high-level bug from a production point of view (at least in our case). Security in my experience is not always pragmatic.
Is it possible to upgrade a specific module instead?
@minherz yes, you can upgrade nf_conntrack
and nf_nat
.
nf_conntrack
and nf_nat
Note for Go users: the pure Go 1.13 DNS resolver will search for the use-vc (golang/go#29594) and single-request (golang/go#29661) options in resolv.conf and emulate the glibc resolver.
use-vc switches DNS resolutions from UDP to TCP, and single-request forces sequential A and AAAA queries instead of parallel queries.
Hi,
We are running alpine (3.4) in a docker container over a Kubernetes cluster (GCP).
We have been seeing some anomalies where our thread is stuck for 2.5 sec. After some research using strace we saw that DNS resolving gets timed-out once in a while.
Here are some examples:
And a good example:
In the past we already had some issues with DNS resolving in older an version(3.3), which have been resolved since we moved to 3.4 (or so we thought).
Is this a known issue? Does anybody have a solution / workaround / suggestion what to do?
Thanks a lot.