rancher-sandbox / rancher-desktop

Container Management and Kubernetes on the Desktop
https://rancherdesktop.io
Apache License 2.0
5.99k stars 283 forks source link

DNS resolution inside containers randomly not working #1557

Closed A-Shevchenko closed 2 years ago

A-Shevchenko commented 2 years ago

Rancher Desktop Version

1.0.1

Rancher Desktop K8s Version

1.22.6

Which container runtime are you using?

moby (docker cli)

What operating system are you using?

Windows

Operating System / Build Version

Windows 10 Pro 1909

What CPU architecture are you using?

x64

Linux only: what package format did you use to install Rancher Desktop?

No response

Windows User Only

No response

Actual Behavior

DNS resolution inside containers randomly not working. First time it happened a week ago, but after restart it was fine. Then it was fine for few days, but today I can't get it working even after few restarts.

Steps to Reproduce

> docker run --rm tutum/dnsutils nslookup api.github.com
;; connection timed out; no servers could be reached

Result

connection timed out; no servers could be reached

Expected Behavior

Successful resolution like:

Non-authoritative answer:
Name:   api.github.com
Address: 140.82.121.5

Additional Information

No response

evertonlperes commented 2 years ago

@A-Shevchenko Thanks for filing the issue. Just for curiosity, are you performing this process inside or outside of WSL distro?

Also, could you provide some logs from RD? You can get all logs navigating to Troubleshooting tab and clicking on Show Logs -- feel free to pack them into a zip file.

A-Shevchenko commented 2 years ago

@evertonlperes I'm running it outside of WSL. Logs are attached, let me know if you want me to enable debug mode 1557.zip .

Nino-K commented 2 years ago

@A-Shevchenko thanks for sharing the logs. Unfortunately, there are some known networking/DNS issues with WSL 2 which can cause unpredictable network behavior. We have implemented a process that will update the DNS configuration on the WSL. This process selects the most suitable DNS servers based on the interface metrics (the most preferred interface). Are you able to test out our new release which will be out shortly to see if that addresses this issue?

Meanwhile, can you please forwards us the output from the followings:

Feel free to redact any confidential or corporate-specific information.

Nino-K commented 2 years ago

@A-Shevchenko can you please upgrade to 1.1.1 to see if it eliminates the issue for you?

A-Shevchenko commented 2 years ago

@Nino-K yes, sorry, due to the war in my country I temporarily lost access to PC where I could do that. Now it's restored, I'll try that in next few days

Nino-K commented 2 years ago

@A-Shevchenko sorry to hear that. We are here to support you, take care!

A-Shevchenko commented 2 years ago

@Nino-K unfortunately, still the same:

> docker run --rm tutum/dnsutils nslookup api.github.com
;; connection timed out; no servers could be reached

resolv.conf:

$ cat /etc/resolv.conf
# This file was automatically generated by WSL. To stop automatic generation of this file, add the following entry to /etc/wsl.conf:
# [network]
# generateResolvConf = false
nameserver 172.17.166.129

route table:

$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.17.166.129  0.0.0.0         UG    0      0        0 eth0
10.42.0.0       0.0.0.0         255.255.255.0   U     0      0        0 cni0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.17.166.128  0.0.0.0         255.255.255.240 U     0      0        0 eth0
vladonemo commented 2 years ago

I believe I know why this happens. I'm using a DNS server of my router (192.168.0.1). This is the output of the dig command from one of my WSL distros (Ubuntu): dig api.github.com @192.168.0.1

;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50053
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 8

;; QUESTION SECTION:
;api.github.com.                        IN      A

;; ANSWER SECTION:
api.github.com.         60      IN      A       140.82.121.6

;; AUTHORITY SECTION:
github.com.             15300   IN      NS      dns1.p08.nsone.net.
github.com.             15300   IN      NS      dns4.p08.nsone.net.
github.com.             15300   IN      NS      dns2.p08.nsone.net.
github.com.             15300   IN      NS      ns-421.awsdns-52.com.
github.com.             15300   IN      NS      ns-520.awsdns-01.net.
github.com.             15300   IN      NS      dns3.p08.nsone.net.
github.com.             15300   IN      NS      ns-1707.awsdns-21.co.uk.
github.com.             15300   IN      NS      ns-1283.awsdns-32.org.

;; ADDITIONAL SECTION:
dns1.p08.nsone.net.     35329   IN      A       198.51.44.8
dns4.p08.nsone.net.     35914   IN      A       198.51.45.72
dns2.p08.nsone.net.     35329   IN      A       198.51.45.8
ns-421.awsdns-52.com.   116055  IN      A       205.251.193.165
ns-520.awsdns-01.net.   131224  IN      A       205.251.194.8
dns3.p08.nsone.net.     35329   IN      A       198.51.44.72
ns-1707.awsdns-21.co.uk. 121520 IN      A       205.251.198.171
ns-1283.awsdns-32.org.  122657  IN      A       205.251.197.3

;; Query time: 30 msec
;; SERVER: 192.168.0.1#53(192.168.0.1)
;; WHEN: Tue Apr 26 08:42:36 CEST 2022
;; MSG SIZE  rcvd: 399

Note, that there is just one A entry returned and a couple of NS entries. The additional section contains A entries for each NSes.

Now, let's doing the same from within a container: (for instance kubectl run iputils --rm -it --image arunvelsriram/utils -- /bin/sh)

; <<>> DiG 9.11.3-1ubuntu1.14-Ubuntu <<>> api.github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2087
;; flags: qr rd ra; QUERY: 1, ANSWER: 9, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 25238b63bed3747a (echoed)
;; QUESTION SECTION:
;api.github.com.                        IN      A

;; ANSWER SECTION:
ns-421.awsdns-52.com.   5       IN      A       205.251.193.165
ns-1707.awsdns-21.co.uk. 5      IN      A       205.251.198.171
api.github.com.         5       IN      A       140.82.121.5
dns4.p08.nsone.net.     5       IN      A       198.51.45.72
dns2.p08.nsone.net.     5       IN      A       198.51.45.8
ns-1283.awsdns-32.org.  5       IN      A       205.251.197.3
dns1.p08.nsone.net.     5       IN      A       198.51.44.8
ns-520.awsdns-01.net.   5       IN      A       205.251.194.8
dns3.p08.nsone.net.     5       IN      A       198.51.44.72

;; Query time: 4 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Tue Apr 26 06:41:52 UTC 2022
;; MSG SIZE  rcvd: 369

See, the output is wrong! It only contains A entries - for both the actual domain (api.github.com) and NSes. The A entries are randomly sorted. Sometimes the correct one comes to the top - then the DNS resolution works OK.

I'm actually not sure why this is happening or which component is responsible for this (CoreDNS?). A simple workaround is to use a DNS server that doesn't return the additional section, for instance the google's or cloudflare ones. But with the current changes I'm not sure how to achieve this.

vladonemo commented 2 years ago

... ok, one possible workaround it to set the DNS server of your host network driver. This will keep the DNS resolution in wsl the same (relying on the dnsmasq-generate script to generate resolv.conf and dnsmaq conf as usual). After doing so:

dig api.github.com

; <<>> DiG 9.11.3-1ubuntu1.14-Ubuntu <<>> api.github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45465
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: b0aa497cb72d8feb (echoed)
;; QUESTION SECTION:
;api.github.com.                        IN      A

;; ANSWER SECTION:
api.github.com.         5       IN      A       140.82.121.5

;; Query time: 36 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Tue Apr 26 07:25:04 UTC 2022
;; MSG SIZE  rcvd: 85

Unfortunately, this has negative impact on the DNS resolution performance. The latency of a public DNS (e.g. 8.8.8.8) is definitely worse than the router's DNS. Sure, I can move this DNS setting to the router itself (to use 8.8.8.8 as the authority, rather than the one I have at the moment), but it will still have negative impact on the DNS resolution for all devices on the LAN in case of the cache-miss.

Nino-K commented 2 years ago

@A-Shevchenko and @vladonemo can you please give this feature a try?

vladonemo commented 2 years ago

@Nino-K doesn't work for me. The wsl.log:

2022-04-29T06:38:19.135Z: Launching background process host-resolver vsock host.
2022-04-29T06:38:29.207Z: Background process host-resolver vsock host exited with status 1 signal null
2022-04-29T06:38:29.207Z: Background process host-resolver vsock host will restart.

The host-resolver.log:

Error: Listen, could not determine VM GUID: could not find vsock-peer process on any hyper-v VM(s)
Usage:
  host-resolver vsock-host [flags]

Flags:
  -c, --built-in-hosts stringToString   List of built-in CNAMEs to IPv4, IPv6 or IPv4-mapped IPv6 in host.rancherdesktop.io=111.111.111.111 format. (default [])
  -h, --help                            help for vsock-host
  -6, --ipv6                            Enable IPv6 address family.
  -s, --upstream-servers stringArray    List of IP addresses for upstream DNS servers. 

The vsock-peer is not running in the distro:

~ # /bin/rc-status
Runlevel: default
 rancher-desktop-guestagent                                                                    [  started 00:17:35 (0) ]
 crond                                                                                                [  unsupervised  ]
 host-resolver                                                                                                [ failed ]
Dynamic Runlevel: hotplugged
Dynamic Runlevel: needed/wanted
 docker                                                                                               [  unsupervised  ]
 cri-dockerd                                                                                          [  unsupervised  ]
Dynamic Runlevel: manual
 k3s                                                                                           [  started 00:17:23 (0) ]
 host-resolver                                                                                                [ failed ]
 local                                                                                                     [  started  ]
vladonemo commented 2 years ago

@Nino-K - after changing the log file name to something else, the actual issue with vsock peer startup is:

 * supervise-daemon: failed to exec `/mnt/c/Work/Playground/rancher-desktop/resources/linux/internal/host-resolver': No such file or directory
vladonemo commented 2 years ago

@Nino-K - got it working. Somehow the host-resolver executable was downloaded wrong. I downloaded it manually, stored it to the resources and not the vsock peer is able to start OK. This is the output:

Error: Listen, could not determine VM GUID: could not find vsock-peer process on any hyper-v VM(s)
Usage:
  host-resolver vsock-host [flags]

Flags:
  -c, --built-in-hosts stringToString   List of built-in CNAMEs to IPv4, IPv6 or IPv4-mapped IPv6 in host.rancherdesktop.io=111.111.111.111 format. (default [])
  -h, --help                            help for vsock-host
  -6, --ipv6                            Enable IPv6 address family.
  -s, --upstream-servers stringArray    List of IP addresses for upstream DNS servers.

time="2022-04-29T11:24:14+02:00" level=info msg="successfully estabilished a handshake with a peer: c59a5f3b-6695-43c9-a61a-629cb618e88c"
time="2022-04-29T11:24:14+02:00" level=warning msg="failed to detect system DNS, falling back to [8.8.8.8 1.1.1.1]" error="open /etc/resolv.conf: The system cannot find the path specified."
time="2022-04-29T11:24:14+02:00" level=info msg="Started vsock-host srv &{udp:<nil> tcp:0xc0000e46c0}"

The missing /etc/resolv.conf confuses me a little bit, because it must exist in the distro. It is created just shortly before the host-resolver peer service starts.

Nevertheless, the DNS resolution works OK now. Kudos ;)

Nino-K commented 2 years ago

The missing /etc/resolv.conf confuses me a little bit, because it must exist in the distro. It is created just shortly before the host-resolver peer service starts.

@vladonemo it is the vsock-host process that is looking for the /etc/resolv.conf on the windows that's is why you see the error, I agree it is confusing and I'm trying to clean up the underlying code to eliminate this kind of misleading logs. One main reason for this confusion was that both vscok-host and vsock-peer were both writing to the same log file as you already figured out. I had a PR for this issue but somehow it missed our current release. I will go ahead and close mine and will use yours and thank you for your contribution. :)

Also, did you try this feature with our latest release? I got a bit confused where you mentioned

Somehow the host-resolver executable was downloaded wrong.

Many thanks

vladonemo commented 2 years ago

@Nino-K

Also, did you try this feature with our latest release?

I actually built and ran it from the source code. Will clean the tree and try again when I'm back at my PC

Thank you for the explanation.

vladonemo commented 2 years ago

ok, started fresh and it all works as expected. Nice one ;)

Nino-K commented 2 years ago

I will go ahead and close this issue, please feel free to reopen it if needed.