gliderlabs / docker-alpine

Alpine Linux Docker image. Win at minimalism!
http://gliderlabs.viewdocs.io/docker-alpine
BSD 2-Clause "Simplified" License
5.71k stars 531 forks source link

nslookup fails in Alpine 3.11.3 #539

Open jgoeres opened 4 years ago

jgoeres commented 4 years ago

We just switched to Alpine 3.11.3 and now nslookup is failing for us unless we explicitly specify the DNS server IP (which is of course not an option), e.g.

foo@/#nslookup abs
Server:         127.0.0.11
Address:        127.0.0.11:53
** server can't find abs.<OUR_INTRANET_DOMAIN>.: NXDOMAIN
** server can't find abs.<OUR_INTRANET_DOMAIN>.: NXDOMAIN
[...]

versus

foo@/#nslookup abs 127.0.0.11
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:

Non-authoritative answer:
Name:   abs
Address: 172.27.0.12

Ping etc. work flawlessly. Alas, we are using nslookup in some of our startscripts to defer starting of the actual application inside the container until another container shows up in DNS (cause the 3rd party tool we are using considers a failed DNS lookup a non-recoverable error...).

Could this be related to enabling the nslookup feature "FEATURE_NSLOOKUP_BIG" as mentioned here: https://github.com/gliderlabs/docker-alpine/issues/476

ncopa commented 4 years ago

This is most likely related the FEATURE_NSLOOKUP_BIG change yes.

Does it work if you use a trailing . (dot)? Eg nslookup abs.

It seems that nslookup will append the search domain if there are no dots in the hostname.

jgoeres commented 4 years ago

Alas, adding a dot doesn't help (the following is running on Kubernetes, not plain Docker, therefore the DNS IP is different,but the result is the same).

foo@/#nslookup abs.
Server:         10.43.0.10
Address:        10.43.0.10:53

** server can't find abs.: NXDOMAIN
** server can't find abs.: NXDOMAIN
jgoeres commented 4 years ago

I am wondering if this is now an acknowledged problem that will eventually be fixed or not. Just to summarize: on a plain docker installation nslookup works when appending a dot to the hostname of the container:

/ # nslookup zookeeper.
Server:         127.0.0.11
Address:        127.0.0.11:53
Non-authoritative answer:
Non-authoritative answer:
Name:   zookeeper
Address: 192.168.80.20

on Kubernetes it doesn't:

/ # nslookup myns-zookeeper.
Server:         10.96.0.10
Address:        10.96.0.10:53
** server can't find myns-zookeeper.: NXDOMAIN
** server can't find myns-zookeeper.: NXDOMAIN
ncopa commented 4 years ago

I am interested in fixing this, or at least report it upstream to busybox bugtracker, but I am not sure what the expected response is. Apparently the kubernetes dns server gives different response? Is it same dns server? are ther any other configs in /etc/resolv.conf?

It would be nice if we had a simple way to reporduce it, using public available internet servers.

What I know for sure is that "zookeeper" is not a valid hostname on internet. Nor is it a toplevel domain so nslookup zookeeper. is sort of expected to fail.

ncopa commented 4 years ago

It would also be helpful if you could report it upstream to https://bugs.busybox.net/

ncopa commented 4 years ago

a tcpdump of the network activity would also be helpful.

jgoeres commented 4 years ago

Hi "zookeeper" is the internal DNS name of a Kubernetes service in our product, located in the same namespace (which is why it doesn't need to be a full-qualified name). It could be the name of any K8s service in the same namespace in which the pod from which you do nslookup is running.

This is the content of resolv.conf in a Kubernetes environment:

/ # cat /etc/resolv.conf
nameserver 10.96.0.10
search mynamespace.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Compared this to a plain Docker environment (boot2docker/Docker Toolbox)

/ # cat /etc/resolv.conf
search <my company's internal domain name here>
nameserver 10.0.2.3

On Alpine 3.11.2, when running nslookup without dot, it works (in particular, exit code is 0):

/ # nslookup zookeeper
nslookup: can't resolve '(null)': Name does not resolve

Name:      zookeeper
Address 1: 10.99.146.94 zookeeper.mynamespace.svc.cluster.local
/ # echo $?
0

Compared to Alpine 3.11.3:

/ # nslookup zookeeper
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find zookeeper.cluster.local: NXDOMAIN

Name:   zookeeper.mynamespace.svc.cluster.local
Address: 10.99.146.94

** server can't find zookeeper.cluster.local: NXDOMAIN
** server can't find zookeeper.svc.cluster.local: NXDOMAIN
** server can't find zookeeper.svc.cluster.local: NXDOMAIN

/ # echo $?
1

Observe that while in the Alpine 3.11.3 case the command apparently finds the proper IP address at some point and writes it into its output, its exit code is 1 instead of 0, and that breaks our start script.

Now with an attached dot, it fails in both 3.11.2 and 3.11.3, with slightly different output:

Alpine 3.11.2

/ # nslookup zookeeper.
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'zookeeper.': Try again
/ # echo $?
1

Alpine 3.11.3

/ # nslookup zookeeper.
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find zookeeper.: NXDOMAIN
** server can't find zookeeper.: NXDOMAIN

/ # echo $?
1

However, this is to be expected, as - AFAIK - adding a dot makes this a full-qualified name, so no lookup relative to the local search domains is performed, and so it has to fail.

This is the result of tcpdump when running nslookup on 3.11.3 (without trailing dot)

13:41:17.319069 IP foo-7f8cfbddd4-g8jv2.34512 > kube-dns.kube-system.svc.cluster.local.53: 9450+ A? zookeeper.mynamespace.svc.cluster.local. (57)
13:41:17.319213 IP foo-7f8cfbddd4-g8jv2.34512 > kube-dns.kube-system.svc.cluster.local.53: 10823+ A? zookeeper.svc.cluster.local. (51)
13:41:17.319228 IP foo-7f8cfbddd4-g8jv2.34512 > kube-dns.kube-system.svc.cluster.local.53: 12073+ A? zookeeper.cluster.local. (47)
13:41:17.319289 IP foo-7f8cfbddd4-g8jv2.34512 > kube-dns.kube-system.svc.cluster.local.53: 13317+ AAAA? zookeeper.mynamespace.svc.cluster.local. (57)
13:41:17.319321 IP foo-7f8cfbddd4-g8jv2.34512 > kube-dns.kube-system.svc.cluster.local.53: 19907+ AAAA? zookeeper.svc.cluster.local. (51)
13:41:17.319329 IP foo-7f8cfbddd4-g8jv2.34512 > kube-dns.kube-system.svc.cluster.local.53: 21050+ AAAA? zookeeper.cluster.local. (47)
13:41:17.319499 IP kube-dns.kube-system.svc.cluster.local.53 > foo-7f8cfbddd4-g8jv2.34512: 19907 NXDomain*- 0/1/0 (144)
13:41:17.319885 IP kube-dns.kube-system.svc.cluster.local.53 > foo-7f8cfbddd4-g8jv2.34512: 9450*- 1/0/0 A 10.99.146.94 (112)
13:41:17.320040 IP kube-dns.kube-system.svc.cluster.local.53 > foo-7f8cfbddd4-g8jv2.34512: 10823 NXDomain*- 0/1/0 (144)
13:41:17.320142 IP kube-dns.kube-system.svc.cluster.local.53 > foo-7f8cfbddd4-g8jv2.34512: 12073 NXDomain*- 0/1/0 (140)
13:41:17.320270 IP kube-dns.kube-system.svc.cluster.local.53 > foo-7f8cfbddd4-g8jv2.34512: 13317*- 0/1/0 (150)
13:41:17.320412 IP kube-dns.kube-system.svc.cluster.local.53 > foo-7f8cfbddd4-g8jv2.34512: 21050 NXDomain*- 0/1/0 (140)
13:41:17.866646 IP kube-dns.kube-system.svc.cluster.local.53 > foo-7f8cfbddd4-g8jv2.47457: 30617 ServFail- 0/0/0 (41)

And this is the for 3.11.2 (again, without the trailing dot):

13:44:05.480641 IP foo2-6df94fb567-g84fc.41771 > kube-dns.kube-system.svc.cluster.local.53: 48903+ A? zookeeper.mynamespace.svc.cluster.local. (57)
13:44:05.480680 IP foo2-6df94fb567-g84fc.41771 > kube-dns.kube-system.svc.cluster.local.53: 49401+ AAAA? zookeeper.mynamespace.svc.cluster.local. (57)
13:44:05.481085 IP kube-dns.kube-system.svc.cluster.local.53 > foo2-6df94fb567-g84fc.41771: 49401*- 0/1/0 (150)
13:44:05.481216 IP kube-dns.kube-system.svc.cluster.local.53 > foo2-6df94fb567-g84fc.41771: 48903*- 1/0/0 A 10.99.146.94 (112)
13:44:05.481516 IP foo2-6df94fb567-g84fc.37388 > kube-dns.kube-system.svc.cluster.local.53: 42577+ PTR? 94.146.99.10.in-addr.arpa. (43)
13:44:05.481864 IP kube-dns.kube-system.svc.cluster.local.53 > foo2-6df94fb567-g84fc.37388: 42577*- 1/0/0 PTR zookeeper.mynamespace.svc.cluster.local. (121)
jgoeres commented 4 years ago

The problem seems to be the additional search domains - if one of them fails, the command is considered failed.

If I remove the extra domains from resolv.conf and only leave

nameserver 10.96.0.10
search mynamespace.svc.cluster.local
options ndots:5

it works:

/ # nslookup zookeeper
Server:         10.96.0.10
Address:        10.96.0.10:53

Name:   zookeeper.mynamespace.svc.cluster.local
Address: 10.99.146.94

/ # echo $?
0
ncopa commented 4 years ago

The problem seems to be the additional search domains - if one of them fails, the command is considered failed.

That is what I suspected. Thank you for conforming that. We should have enough info to be able to fix this thing.

Next step will be to report it to busybox developers. https://bugs.busybox.net/

I am sorry that I have not had time to prioritize this, but I believe we will be able to have a fix for this for 3.11.4.

Thanks!

jgoeres commented 4 years ago

Just to clarify - should we report this to busybox devs (mainly by pointing them to this issue), or will you?

ncopa commented 4 years ago

Just to clarify - should we report this to busybox devs (mainly by pointing them to this issue), or will you?

I was hoping you could help me with that, while I work on a fix ;)

Thanks!

ncopa commented 4 years ago

I have reported it upstream: https://bugs.busybox.net/show_bug.cgi?id=12541

ncopa commented 4 years ago

I have pushed a fix to alpine edge. Can you please test if it solves your issue? Use alpine:edge and do apk upgrade -U -a to get busybox-1.31.1-r10.

jgoeres commented 4 years ago

I tested it, on Kubernetes it works as expected:

$ kubectl run foo -i -t --image=alpine:edge --rm=true -- sh
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
/ # apk upgrade -U -a
fetch http://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
(1/2) Upgrading busybox (1.31.1-r9 -> 1.31.1-r10)
Executing busybox-1.31.1-r10.post-upgrade
(2/2) Upgrading ssl_client (1.31.1-r9 -> 1.31.1-r10)
Executing busybox-1.31.1-r10.trigger
OK: 6 MiB in 14 packages
/ # nslookup mynamespace-zookeeper
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find mynamespace-zookeeper.cluster.local: NXDOMAIN

Name:   mynamespace-zookeeper.mynamespace.svc.cluster.local
Address: 10.101.67.109

** server can't find mynamespace-zookeeper.svc.cluster.local: NXDOMAIN

** server can't find mynamespace-zookeeper.cluster.local: NXDOMAIN

** server can't find mynamespace-zookeeper.svc.cluster.local: NXDOMAIN

/ # echo $?
0

The IP get's resolved against one of the domains found in /etc/resolv.conf and the exit code is 0.

Alas, it still doesn't work in plain Docker environments as it did before, unless I append a dot:

[myself@docker01 ~]$ docker run -it --rm --network mydockernetwork alpine:edge sh
/ # apk upgrade -U -a
fetch http://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
(1/2) Upgrading busybox (1.31.1-r9 -> 1.31.1-r10)
Executing busybox-1.31.1-r10.post-upgrade
(2/2) Upgrading ssl_client (1.31.1-r9 -> 1.31.1-r10)
Executing busybox-1.31.1-r10.trigger
OK: 6 MiB in 14 packages
/ # ping zookeeper
PING zookeeper (192.168.192.21): 56 data bytes
64 bytes from 192.168.192.21: seq=0 ttl=64 time=0.418 ms
64 bytes from 192.168.192.21: seq=1 ttl=64 time=0.218 ms
64 bytes from 192.168.192.21: seq=2 ttl=64 time=0.232 ms
^C
--- zookeeper ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.218/0.289/0.418 ms
/ # nslookup zookeeper
Server:         127.0.0.11
Address:        127.0.0.11:53

** server can't find zookeeper.<mycompany.internaldomain.com>.: NXDOMAIN

** server can't find zookeeper.ame.<mycompany3.internaldomain.com>.: NXDOMAIN

** server can't find zookeeper.<mycompany3.internaldomain.com>.: NXDOMAIN

** server can't find zookeeper.<mycompany2.internaldomain.com>.: NXDOMAIN

** server can't find zookeeper.<mycompany3.internaldomain.com>.: NXDOMAIN

** server can't find zookeeper.<mycompany2.internaldomain.com>.: NXDOMAIN

** server can't find zookeeper.<mycompany3.internaldomain.com>.: NXDOMAIN

** server can't find zookeeper.<mycompany3.internaldomain.com>.: NXDOMAIN

/ # echo $?
1
/ # nslookup zookeeper.
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Non-authoritative answer:
Name:   zookeeper
Address: 192.168.192.21

/ # cat /etc/resolv.conf
search <mycompany2.internaldomain.com>. ame.<mycompany3.internaldomain.com>. <mycompany3.internaldomain.com>. <mycompany3.internaldomain.com>.
nameserver 127.0.0.11
options ndots:0
/ #

As you can see, "Ping" can resolve trhe name just fine, nslookup without appended dot fails. Also see content of resolv.conf

BretFisher commented 4 years ago

I can confirm it still doesn't work in alpine:edge for docker as of 3/5/2020.

alpine@sha256:13d22f83f248957d0a553f14154d5f3fd413b6c0c595ebb094b0e12cbac71797

How I reproduced:

$ docker network create mynet
ac5d340dc87a0833ba86926cbeb50cc68bb98ed35d5dc9b01ab28a27e9c5b95b

$ docker run -d --network mynet --name website nginx
de6f0284a2a071d891f499bd4485535ec391fcd7dc9fef3bc1010a3cba3d384d

$ docker run --rm --network mynet alpine:edge nslookup website
Server:         127.0.0.11
Address:        127.0.0.11:53

** server can't find website.51ur3jppi0eupdptvsj42kdvgc.bx.internal.cloudapp.net: NXDOMAIN
** server can't find website.51ur3jppi0eupdptvsj42kdvgc.bx.internal.cloudapp.net: NXDOMAIN

Works in alpine:3.11.2

$ docker run --rm --network mynet alpine:3.11.2 nslookup website
nslookup: can't resolve '(null)': Name does not resolve

Name:      website
Address 1: 172.20.0.2 website.mynet
ncopa commented 4 years ago

Works with latest alpine:edge for me:

$ docker run --rm --network mynet alpine:edge nslookup website
Server:     127.0.0.11
Address:    127.0.0.11:53

Non-authoritative answer:
Non-authoritative answer:
Name:   website
Address: 172.19.0.2
ncopa commented 4 years ago

As you can see, "Ping" can resolve trhe name just fine, nslookup without appended dot fails. Also see content of resolv.conf

@jgoeres can you please test with latest edge and latest stable 3.11.5 and compare with nslookup from bind-tools package (eg apk add bind-tools)

weibeld commented 4 years ago

@ncopa I tested with edge, 3.11.5 and 3.11.2 on Kubernetes and compared with nslookup from the bind-tools package:

alpine:edge

/ # nslookup conncheck-service
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find conncheck-service.svc.cluster.local: NXDOMAIN

Name:   conncheck-service.conncheck.svc.cluster.local
Address: 10.111.127.249

** server can't find conncheck-service.svc.cluster.local: NXDOMAIN

** server can't find conncheck-service.cluster.local: NXDOMAIN

** server can't find conncheck-service.cluster.local: NXDOMAIN

** server can't find conncheck-service.eu-central-1.compute.internal: NXDOMAIN

** server can't find conncheck-service.eu-central-1.compute.internal: NXDOMAIN

/ # echo $?
0

Long output, exit code is 0 if at least one of the queries succeeds (desired behaviour).

alpine:edge with nslookup from bind-tools

/ # nslookup conncheck-service
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   conncheck-service.conncheck.svc.cluster.local
Address: 10.111.127.249

/ # echo $?
0

alpine:3.11.5

/ # nslookup conncheck-service
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find conncheck-service.cluster.local: NXDOMAIN

Name:   conncheck-service.conncheck.svc.cluster.local
Address: 10.111.127.249

** server can't find conncheck-service.svc.cluster.local: NXDOMAIN

** server can't find conncheck-service.svc.cluster.local: NXDOMAIN

** server can't find conncheck-service.cluster.local: NXDOMAIN

** server can't find conncheck-service.eu-central-1.compute.internal: NXDOMAIN

** server can't find conncheck-service.eu-central-1.compute.internal: NXDOMAIN

/ # echo $?
1

Long output, exit code is 1 if any of the queries fails (undesired behaviour).

alpine:3.11.5 with nslookup from bind-utils

/ # nslookup conncheck-service
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   conncheck-service.conncheck.svc.cluster.local
Address: 10.111.127.249

/ # echo $?
0

alpine:3.11.2

/ # nslookup conncheck-service
nslookup: can't resolve '(null)': Name does not resolve

Name:      conncheck-service
Address 1: 10.111.127.249 conncheck-service.conncheck.svc.cluster.local
/ # echo $?
0

Short output, exit code 0 if at least one of the queries succeeds (desired behaviour).

alpine:3.11.2 with nslookup from bind-tools

/ # nslookup conncheck-service
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   conncheck-service.conncheck.svc.cluster.local
Address: 10.111.127.249

/ # echo $?
0
shaun-earsom commented 4 years ago

We're now on Alpine 3.12.0 if you grab alpine:latest. It looks like nslookup is working fine. So this "ticket" should be closed.

SnorreSelmer commented 4 years ago

We're now on Alpine 3.12.0 if you grab alpine:latest. It looks like nslookup is working fine. So this "ticket" should be closed.

user@server:~$ docker container run --rm --net dnsrr alpine nslookup search
Server:         127.0.0.11
Address:        127.0.0.11:53

** server can't find search.u01: NXDOMAIN

** server can't find search.u01: NXDOMAIN

user@server:~$ docker container run --rm --net dnsrr alpine nslookup search.
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Non-authoritative answer:
Name:   search
Address: 172.18.0.3
Name:   search
Address: 172.18.0.2

This is on alpine:latest

exanup commented 3 years ago

Working for me totally fine with alpine:latest.

$ docker run --rm --name alpine -it --network net alpine nslookup web
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Non-authoritative answer:
Name:   web
Address: 172.20.0.2
freimer commented 3 years ago

Fails for me. Interesting that alpine 3.11.2 and 3.11.3 both say busybox is the same version (1.31.1). And, they are the same file size. However, the 3.11.2 one has a date of Dec 18, 2019, while the 3.11.3 one has a date of Jan 15, 2020. And, the sha256 hash is different. The only dynamic library is libc.musl-x86_64.so.1, and they have the same date and hash. Copying the busybox from 3.11.2 to 3.11.3 makes it work. The APK for busybox is 1.31.1-r8 on 3.11.2 and 1.31.1-r9 on 3.11.3.

What's the diff between r8 and r9?:

diff --git a/main/busybox/busyboxconfig b/main/busybox/busyboxconfig
index 63dd9c6e7f..53e00e266f 100644
--- a/main/busybox/busyboxconfig
+++ b/main/busybox/busyboxconfig
@@ -925,8 +925,8 @@ CONFIG_NETSTAT=y
 CONFIG_FEATURE_NETSTAT_WIDE=y
 CONFIG_FEATURE_NETSTAT_PRG=y
 CONFIG_NSLOOKUP=y
-# CONFIG_FEATURE_NSLOOKUP_BIG is not set
-# CONFIG_FEATURE_NSLOOKUP_LONG_OPTIONS is not set
+CONFIG_FEATURE_NSLOOKUP_BIG=y
+CONFIG_FEATURE_NSLOOKUP_LONG_OPTIONS=y
 CONFIG_NTPD=y
 CONFIG_FEATURE_NTPD_SERVER=y
 CONFIG_FEATURE_NTPD_CONF=y

I'm not an expert on busybox, but it looks like these are compile-time options, so we can't "fix" this by a configuration change. The best option may be for the maintainer of the Alpine Linux package to revert this change. Basically what the change does is turn on the internal busybox resolver, rather than using the standard library. If there is an issue with the busybox resolver code, then of course that should be fixed. However, it was a change in the Alpine Linux package that turned this feature on and "broke" it. Can we get it turned back off? An ltrace of the r8 and r9 versions clearly shows the r8 calling the standard library resolver, where r9 does not. It also shows the r9 version (with the internal busybox resolver) string comparing for domain, search, and nameserver keywords in the resolv.conf, but not options. Look at the busybox source file for nslookup.c. It has no ability to parse options, and hence ndots. Please revert this.

Oh, and it is also broken in alpine:latest, which uses busybox-1.31.1-r19, ltrace shows the same behavior.

willzgli commented 3 years ago

We're now on Alpine 3.12.0 if you grab alpine:latest. It looks like nslookup is working fine. So this "ticket" should be closed.

user@server:~$ docker container run --rm --net dnsrr alpine nslookup search
Server:         127.0.0.11
Address:        127.0.0.11:53

** server can't find search.u01: NXDOMAIN

** server can't find search.u01: NXDOMAIN

user@server:~$ docker container run --rm --net dnsrr alpine nslookup search.
Server:         127.0.0.11
Address:        127.0.0.11:53

Non-authoritative answer:
Non-authoritative answer:
Name:   search
Address: 172.18.0.3
Name:   search
Address: 172.18.0.2

This is on alpine:latest

But I found the nslookup in both alpine:3.12 and alpine:3.12.0 doesn't use search option when the number of dots in name is equal or greater than 1. In test pod , content of /etc/resolv.conf is bellow. Platform is linux/amd64

nameserver 172.16.253.163 search default.svc.cluster.local svc.cluster.local cluster.local options ndots:5

For example:

nslookup google.com

image

As shown above, only request with "google.com. " is sent to dns pod.

nslookup google

image

I am confused, which version of alpine or busybox will do search functions correctly on earth ?

verdel commented 3 years ago

But I found the nslookup in both alpine:3.12 and alpine:3.12.0 doesn't use search option when the number of dots in name is equal or greater than 1. In test pod , content of /etc/resolv.conf is bellow. Platform is linux/amd64

I found the reason for this behavior. Alpine version 3.11.2 use busybox version 1.31.1-r8 Alpine version 3.11.3 use busybox version 1.31.1-r9

In alpine 3.11.3 for busybox package maintainers enable compile flag CONFIG_FEATURE_NSLOOKUP_BIG and CONFIG_FEATURE_NSLOOKUP_LONG_OPTIONS (https://github.com/alpinelinux/aports/commit/e5c984f68aabb28de623a7e3ada5a223c2b66d77).

This change the implementation of nslookup. With this option disabled, the getaddrinfo() function is called from the musl system library. This function takes into account the value of the ndots option from /etc/resolv.conf.

If the compile option is enabled, the internal implementation of the mechanism for obtaining the IP address from the DNS name is used.

I will not fully describe the entire mechanism. In the process of preparing the request, the add_query_with_search() function is used.

busybox/networking/nslookup.c

static void add_query_with_search(int type, const char *dname)
{
    char *s;

    if (type == T_PTR || !G.search || strchr(dname, '.')) {
        add_query(type, dname);
        return;
    }

    s = G.search;
    for (;;) {
        char *fullname, *e;

        e = skip_non_whitespace(s);
        fullname = xasprintf("%s.%.*s", dname, (int)(e - s), s);
        add_query(type, fullname);
        s = skip_whitespace(e);
        if (!*s)
            break;
    }
}

If there is at least one dot in the domain name passed for search, the domains specified in the search option in /etc/resolv.conf are not added

if (type == T_PTR || !G.search || strchr(dname, '.')) {
        add_query(type, dname);
        return;
    }

If there are no dots, then the domains specified in the search option are added.

That's the whole secret. Busybox developers do not take into account the ndots option from /etc/resolv.conf in the internal implementation of the DNS name resolution procedure. In the algorithm, it is always equal to 1.

smlx commented 3 years ago

As @verdel explains very clearly, this problem is due to the buggy busybox handling of search domains in resolv.conf.

Since the change made in https://github.com/gliderlabs/docker-alpine/issues/476 / https://github.com/alpinelinux/aports/commit/e5c984f68aabb28de623a7e3ada5a223c2b66d77 seems to be only solving a cosmetic issue, can we get that change reverted?

Yes, you get a can't resolve '(null)' message but functionally that doesn't matter - at least nslookup actually resolves the name.

Freundschaft commented 3 years ago

This problem is also present on alpine:3.13 as of today. We are unable to run curl, wget, etc against github.com since dns lookup fails.

could this also be related to this issue (https://github.com/Requarks/wiki/discussions/3238)

leonboot commented 3 years ago

I am experiencing this issue since alpine:3.13 as well, albeit under different circumstances as mentioned here, but likely related nonetheless. I'm using Docker on my development machine through Dinghy. It seems the resolving mechanism doesn't play nice with its DNS server (which is used to resolve *.docker addresses to its own IP address and forward all other queries to the host's resolver). Running an nslookup actually returns the IP address of the requested hostname, but ends with an NXDOMAIN error. Here are my findings:

The biggest issue is that commands suck as apk add [package] fail, because the APK repository hostname cannot be resolved. I've tried the following command with Alpine images from 3.9 to 3.13:

docker run --rm -ti alpine:3.11 sh -c 'apk --no-cache add curl && curl -I https://www.google.com/'

Up to 3.12, the output is as follows:

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/4) Installing ca-certificates (20191127-r4)
(2/4) Installing nghttp2-libs (1.41.0-r0)
(3/4) Installing libcurl (7.69.1-r3)
(4/4) Installing curl (7.69.1-r3)
Executing busybox-1.31.1-r19.trigger
Executing ca-certificates-20191127-r4.trigger
OK: 7 MiB in 18 packages
HTTP/2 200 
[...]

The 3.13 image, however, produces the following output:

fetch https://dl-cdn.alpinelinux.org/alpine/v3.13/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.13/main: DNS lookup error
fetch https://dl-cdn.alpinelinux.org/alpine/v3.13/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.13/community: DNS lookup error
ERROR: unable to select packages:
  curl (no such package):
    required by: world[curl]

The errors are very specific to my installation, I've tried these commands on an Ubuntu based Docker installation without any issues. Perhaps the cause of the issues @gaby is having is related. Could it be the resolving mechanism has issues with certain resolvers? Could it be an IPv6 issue? There's an issue over on the Dinghy repository that might be related.

sdwerwed commented 3 years ago

Today 2 April 2021 I can also confirm this issue on nslookup (ping is able to resolve the dns name)

Alpine images 3.14.0_alpha20210212 and 3.13.4 nslookup kubernetes.default: image Debian nslookup kubernetes.default: image

thotypous commented 3 years ago

On Kubernetes, nslookup works fine for me, but every other software fails to resolve DNS:

image

This is both on latest and on edge.

Removing the search line from /etc/resolv.conf "solves" the issue.

project-administrator commented 3 years ago

Also, removing (or setting the value to "1") of options ndots:5 helps

aliask commented 3 years ago

I also ran into this issue on Alpine 3.14.2 - nslookup worked fine, but curl, apk, ping etc would fail. Some of my nameservers were returning NXDOMAIN because my VPN forces non-VPN DNS queries to respond NXDOMAIN instead of blocking the request. Removing the public nameservers from resolv.conf "fixed" the issue, but this is frustrating because I can't maintain a resolv.conf with VPN and regular DNS servers with graceful fallback.

from-nibly commented 2 years ago

Seeing this issue on 3.16.0 while on a vpn

resolv.conf looks like this

# Generated by resolvconf
search <my-company1>.com <my-company2>.com lan
nameserver <internal-ip-on-vpn>
nameserver <internal-ip-on-vpn>
nameserver <my-router>
options edns0

When using nslookup everything works fine.

However when curling an internal company domain I get one successful call, then the rest are failures. Unless I wait a minute or so and try again. It's very strange.

VinceCui commented 2 years ago

Any update? Is this problem so difficult to solve? Alpine's image is small and light, our team likes it, but this problem confuses us.

wilbit commented 1 year ago

I also ran into this issue on Alpine 3.14.2 - nslookup worked fine, but curl, apk, ping etc would fail. Some of my nameservers were returning NXDOMAIN because my VPN forces non-VPN DNS queries to respond NXDOMAIN instead of blocking the request. Removing the public nameservers from resolv.conf "fixed" the issue, but this is frustrating because I can't maintain a resolv.conf with VPN and regular DNS servers with graceful fallback.

I've met the similar issue. My docker gitlab-runner is connected to GitLab via VPN. ping, nslookup and (more important to me) ssh cannot resolve a domain name.

$ cat /etc/resolv.conf
# DNS requests are forwarded to the host. DHCP DNS options are ignored.
nameserver 192.168.65.7

alpine:3.13.0 and later do not work to me. The latest version which is working to me is alpine:3.12.12 The errors look like

$ ping -c 4 $SSH_HOST
ping: bad address '<hidden>.local'
$ nslookup $SSH_HOST
Server:     192.168.65.7
Address:    192.168.65.7:53
Non-authoritative answer:
Name:   <hidden>.local
Address: 172.16.1.1[39](https://<masked>/-/jobs/178#L39)
*** Can't find <hidden>: No answer
$ ssh -v $SSH_USER@$SSH_HOST "echo '!!!done!!!'"
OpenSSH_9.1p1, OpenSSL 3.0.8 7 Feb 2023
debug1: Reading configuration data /etc/ssh/ssh_config
ssh: Could not resolve hostname <hidden>.local: Try again
TheDevilDan commented 6 months ago

When i add bind tools, all works

apk add bind-tools

nslookup works fine, it's a workaround ? its busybox bug only ?

gaby commented 6 months ago

@TheDevilDan This was fixed in recent alpine releases. Forgot which version

TheDevilDan commented 6 months ago

I use 8.1-fpm-alpine the latest, and the domains are not full when i request : exit 1

MyServer# kubectl exec -n mynamespace -it AlpineBindUtils -- nslookup kubernetes.default
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

MyServer# kubectl exec -n mynamespace -it AlpineWithoutBindUtils -- nslookup kubernetes.default
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find kubernetes.default: NXDOMAIN

** server can't find kubernetes.default: NXDOMAIN

command terminated with exit code 1
gaby commented 6 months ago

That bug was fixed in 3.18, and your image is based on that according to Docker Hub

TheDevilDan commented 6 months ago

Very strange, I have the problem in all pods alpine with busybox inside, I test with Traefik V3.0 RC5, same problem, I have to install bind-tools and it works perfectly after that

MyServer# kubectl exec -n mynamespace -it traefik-7b595bc5d6-5kcmm -- /bin/sh

/ # nslookup kubernetes.default
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find kubernetes.default: NXDOMAIN

** server can't find kubernetes.default: NXDOMAIN

/ # apk list --installed
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.19/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.19/community: No such file or directory
alpine-baselayout-3.4.3-r2 x86_64 {alpine-baselayout} (GPL-2.0-only) [installed]
alpine-baselayout-data-3.4.3-r2 x86_64 {alpine-baselayout} (GPL-2.0-only) [installed]
alpine-keys-2.4-r1 x86_64 {alpine-keys} (MIT) [installed]
apk-tools-2.14.0-r5 x86_64 {apk-tools} (GPL-2.0-only) [installed]
busybox-1.36.1-r15 x86_64 {busybox} (GPL-2.0-only) [installed]
busybox-binsh-1.36.1-r15 x86_64 {busybox} (GPL-2.0-only) [installed]
ca-certificates-20230506-r0 x86_64 {ca-certificates} (MPL-2.0 AND MIT) [installed]
ca-certificates-bundle-20230506-r0 x86_64 {ca-certificates} (MPL-2.0 AND MIT) [installed]
libc-utils-0.7.2-r5 x86_64 {libc-dev} (BSD-2-Clause AND BSD-3-Clause) [installed]
libcrypto3-3.1.4-r5 x86_64 {openssl} (Apache-2.0) [installed]
libssl3-3.1.4-r5 x86_64 {openssl} (Apache-2.0) [installed]
musl-1.2.4_git20230717-r4 x86_64 {musl} (MIT) [installed]
musl-utils-1.2.4_git20230717-r4 x86_64 {musl} (MIT AND BSD-2-Clause AND GPL-2.0-or-later) [installed]
scanelf-1.3.7-r2 x86_64 {pax-utils} (GPL-2.0-only) [installed]
ssl_client-1.36.1-r15 x86_64 {busybox} (GPL-2.0-only) [installed]
tzdata-2024a-r0 x86_64 {tzdata} (Public-Domain) [installed]
zlib-1.3.1-r0 x86_64 {zlib} (Zlib) [installed]

/ # apk add bind-tools
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
(1/14) Installing fstrm (0.6.1-r4)
(2/14) Installing krb5-conf (1.0-r2)
(3/14) Installing libcom_err (1.47.0-r5)
(4/14) Installing keyutils-libs (1.6.3-r3)
(5/14) Installing libverto (0.3.2-r2)
(6/14) Installing krb5-libs (1.21.2-r0)
(7/14) Installing json-c (0.17-r0)
(8/14) Installing nghttp2-libs (1.58.0-r0)
(9/14) Installing protobuf-c (1.4.1-r7)
(10/14) Installing libuv (1.47.0-r0)
(11/14) Installing xz-libs (5.4.5-r0)
(12/14) Installing libxml2 (2.11.7-r0)
(13/14) Installing bind-libs (9.18.24-r1)
(14/14) Installing bind-tools (9.18.24-r1)
Executing busybox-1.36.1-r15.trigger
OK: 18 MiB in 31 packages

/ # nslookup kubernetes.default
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1