minoca / os

Minoca operating system
Other
2.71k stars 232 forks source link

DNS recursion loop in wget #86

Closed vascocosta closed 7 years ago

vascocosta commented 7 years ago

This could be a coincidence or some issue with my DNS servers, but after the latest commits I've started noticing a high frequency of "DNS recursion loop" errors in wget.

In QEMU it doesn't seem too serious and it eventually resolves, on the other hand it sends wget into an infinite loop in VirtualBox. This is seemingly triggered by websites using https redirects.

I wonder if this could in away be related to the recent TCP fix by @ccstevens, but maybe I'm totally off here.

On a related note, I wonder if there's any way to manually override the DNS servers provided by the DHCP server as this would be handy to debug this issue. From what I've seen the logic is hardcoded in the kernel and I'm not sure there's a way to expose it to the user land.

evangreen commented 7 years ago

I was monkeying around with DNS lately to add support for SOA records. I probably broke this. I'm downloading VirtualBox now. Can you give me an example of a site that goes crazy? Also if it's easy would you be able to compile a version of libc with ClDebugDns (in apps/libc/dynamic/gaddrinf.c) set to TRUE, and send me the output from the wget request?

vascocosta commented 7 years ago

I basically tried random URLs, one that always seems to make VirtualBox go crazy is wget http://linux.com/.

I'll let you know about libc compiled with ClDebugDns set to true...

evangreen commented 7 years ago

Yeah I think I need your output. This is what I see:

~# wget http://linux.com/
--2016-12-12 17:25:48--  http://linux.com/
Resolving linux.com... DNS: Lookup 'linux.com'.
No responses
Out of DNS servers to try.
DNS: Lookup 'linux.com'.
A linux.com 151.101.129.5 Expires Mon Dec 12 19:57:37 2016
A linux.com 151.101.193.5 Expires Mon Dec 12 19:57:37 2016
A linux.com 151.101.1.5 Expires Mon Dec 12 19:57:37 2016
A linux.com 151.101.65.5 Expires Mon Dec 12 19:57:37 2016
DNS: Found 4 results.
DNS: Final Results:
A linux.com 151.101.129.5 Expires Mon Dec 12 19:57:37 2016
A linux.com 151.101.193.5 Expires Mon Dec 12 19:57:37 2016
A linux.com 151.101.1.5 Expires Mon Dec 12 19:57:37 2016
A linux.com 151.101.65.5 Expires Mon Dec 12 19:57:37 2016
getaddrinfo: Name linux.com Service (null): No error.
151.101.129.5 Flags 0 SockType 4 Protocol 6 Addrlen 32.
151.101.193.5 Flags 0 SockType 4 Protocol 6 Addrlen 32.
151.101.1.5 Flags 0 SockType 4 Protocol 6 Addrlen 32.
151.101.65.5 Flags 0 SockType 4 Protocol 6 Addrlen 32.
151.101.129.5, 151.101.193.5, 151.101.1.5, ...
Connecting to linux.com|151.101.129.5|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://linux.com/ [following]
--2016-12-12 17:25:48--  https://linux.com/
vascocosta commented 7 years ago

This is what I see with ClDebugDns (looks like it ends up using a null value as DNS):

~# wget http://linux.com/
--2016-12-13 02:06:25--  http://linux.com/
Resolving linux.com... DNS: Lookup 'linux.com'.
SOA linux.com ns1.linux-foundation.org Expires Tue Dec 13 02:32:05 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:06 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:06 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:06 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:06 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:06 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:04 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:06 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:34:56 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:34:15 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:34:55 2016
Trying name server 'ns1.linux-foundation.org'.
Error: DNS recursion loop.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Trying name server '(null)'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:05 2016
Trying name server 'ns1.linux-foundation.org'.
Error: DNS recursion loop.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Out of DNS servers to try.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Trying name server '(null)'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:06 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:05 2016
Trying name server 'ns1.linux-foundation.org'.
Error: DNS recursion loop.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Trying name server '(null)'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:32:05 2016
Trying name server 'ns1.linux-foundation.org'.
Error: DNS recursion loop.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Out of DNS servers to try.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Out of DNS servers to try.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Trying name server '(null)'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:34:57 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:34:16 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:36:24 2016
Trying name server 'ns1.linux-foundation.org'.
Error: DNS recursion loop.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Trying name server '(null)'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Tue Dec 13 02:34:55 2016
Trying name server 'ns1.linux-foundation.org'.
Error: DNS recursion loop.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Out of DNS servers to try.
Error: Failed to get address of DNS server ns1.linux-foundation.org
Trying name server '(null)'.
vascocosta commented 7 years ago

I wonder if this is related, but this started happening when my router switched to IPv6 on its wlan interface. My ISP often goes back and forth between IPv4 and IPv6 addresses.

Here's the lookup failing, as expected, when all it gets is an IPv6 address:

wget http://namecheap.com/
--2016-12-13 02:41:15--  http://namecheap.com/
Resolving namecheap.com... DNS: Lookup 'namecheap.com'.
SOA namecheap.com a1.verisigndns.com Expires Tue Dec 13 03:11:14 2016
Trying name server 'a1.verisigndns.com'.
DNS: Lookup 'a1.verisigndns.com'.
AAAA a1.verisigndns.com 2001:500:7967::2:33 Expires Tue Dec 13 03:24:07 2016
DNS: Found 1 results.
DNS: Lookup 'namecheap.com'.
DNS: Failed to execute query: Address family not supported by protocol.
getaddrinfo: Name namecheap.com Service (null): System error.
failed: Address family not supported by protocol.
wget: unable to resolve host address namecheap.com

Shouldn't it try to get an A record instead of AAA though?

vascocosta commented 7 years ago

So, just to be sure I decided to checkout a commit right before your SOA change.

git checkout -b before-soa-change 83647ce4c25068aba0bc53be66c30234458e2408
make clean && make

After this, without any other change, the issue seems to be gone, like you can see here. Sorry for the screenshot instead of a paste but I couldn't use scp to extract the log of the output due to the interrupted system call issue that Chris fixed after this commit.

evangreen commented 7 years ago

Hi Vasco. I wasn't able to reproduce exactly what you were seeing, but I think I can guess at what's going on from your logs. I pushed change 52110731ae5344952867660b4fce3d4c05b4d211, can you give that a shot? We were getting stuck because it was telling us that to resolve linux-foundation.org we needed to go to ns1.linux-foundation.org, which is actually further away from our goal than we wanted to be. This is really because this particular site doesn't seem to be IPv6 ready.

The kernel doesn't actually support ipv6 yet, but the hope was that at least the C library would be ready for it when it gets flipped on. That's the reason it's querying for AAAA records.

vascocosta commented 7 years ago

Hi Evan. Your change fixed it on QEMU, nevertheless it's now segfaulting on VirtualBox as you can see below. If the log doesn't give you any obvious hint, feel free to switch your focus to other issues. I can try to figure it out myself and in the process learn how getaddrinfo and its helper functions work. This could be related to VirtualBox itself though, as sometimes my Linux VMs there don't resolve any hostname at all.

I suppose a user mode debugger like gdb would help us greatly here. Are there any plans for such a debugger?

Resolving linux.com... DNS: Lookup 'linux.com'.
SOA linux.com ns1.linux-foundation.org Expires Thu Dec 15 08:12:21 2016
Trying name server 'ns1.linux-foundation.org'.
DNS: Lookup 'ns1.linux-foundation.org'.
SOA linux-foundation.org ns1.linux-foundation.org Expires Thu Dec 15 08:17:04 2016
Skipping name server ns1.linux-foundation.org, subdomain of ns1.linux-foundation.org
wget terminated by signal 11: Segmentation fault
evangreen commented 7 years ago

Hi Vasco. Yeah, I'm having trouble reproducing this on my side. What are the IP addresses of your DNS servers? Maybe I can ping them myself.

Sure, if you wanted to take a look on your own it would be much appreciated. A couple of initial hints:

vascocosta commented 7 years ago

To eliminate one further variable I was able to force VirtualBox to use a Google name server on the DHCP lease:

Minoca debugger version 1.1. Protocol version 1.3.
Waiting to connect...
Connected to Minoca OS on x86
System Version 0.3.1.2029 development debug vasco-issue-86-c8ec3a2 Dec 15 2016 00:46:52
Built on Thu Dec 15, 2016 12:46 AM.
New Device: RawDisk, 0x801b7958
New Device: Partition1, 0x801ba460
New Device: Partition2, 0x801b85e8
New Device: Volume0, 0x801ba578
New Device: Volume1, 0x801b9fb0
NET: 08:00:27:B9:20:F8 up, Speed 100 mbps
DHCP Assignment:
           Server IP: 10.0.2.2
          Offered IP: 10.0.2.15
           Router IP: 10.0.2.2
       DNS Server IP: 8.8.8.8
          Lease Time: 1 day.

Then, as you mentioned, this segmentation fault was causing a break into the kernel debugger:

Debugger: 0xd687b2516, Target: 0xd867d4529
libminocaos.so!RtlStringLength+0x6:
777e491d cmp    [edx], 0x0  ; 803a00
kd>kn
No Frame    RetAddr  Call Site
 0 7fffe640 776b9fa9 libminocaos.so!RtlStringLength+0x6
 1 7fffe660 7769e158 libc.so!strlen+0x1a
 2 7fffe690 7769e528 libc.so!ClpIsNameSubdomain+0x1f
 3 7fffe730 7769f273 libc.so!ClpPerformDnsTranslation+0x396
 4 7fffe780 7769e5b7 libc.so!ClpGetNameServerAddress+0xbf
 5 7fffe830 7769eb25 libc.so!ClpPerformDnsTranslation+0x425
 6 7fffe970 0004c49a libc.so!getaddrinfo+0x4ae
 7 7fffeb68 0003d1ad sh+0x4b49a
 8 7fffeb98 00027eb5 sh+0x3c1ad
 9 7fffebf8 000292a9 sh+0x26eb5
10 7fffec68 000288a6 sh+0x282a9
11 7fffecb8 00028a23 sh+0x278a6
12 7fffed08 00036571 sh+0x27a23
13 7fffed68 0003d1ad sh+0x35571
14 7fffed98 0003d4e2 sh+0x3c1ad
15 7fffedf8 0005ae82 sh+0x3c4e2
16 7fffee34 777d6f6d sh+0x59e82
17 7fffee84 00000000 libminocaos.so!OsDynamicLoaderMain+0x298
kd>dv
[0x7fffe640] String              : 0x00000000

<const>      Length              : 0x0
kd>

From which was clear that either Query or Domain in ClpIsNameSubdomain were a NULL reference. Thus, once RtlStringLength did the actual pointer dereferencing to 0x00000000 the segmentation fault was triggered.

By adding the following workaround to the beginning of ClpIsNameSubdomain everything works nicely and it resolves. However, the real question is why ClpPerformDnsTranslation calls ClpIsNameSubdomain with a NULL NameServer->Value or Name. That's what I'll be trying to answer tomorrow.

if (Query == NULL || Domain == NULL) {
    return FALSE;
}
evangreen commented 7 years ago

Nice, thank you for tracking that down. That was my bonehead programming on that previous change. I forgot to consider that for the original DNS servers we get from the kernel, those are just addresses, and don't have names. I pushed 1e5228e80b42f9186cafc8eae47db060c96f1e15, give it a try.

vascocosta commented 7 years ago

The latest changes fix the issue on VirtualBox. Closing this one now.