robur-coop / happy-eyeballs

An implementation of happy eyeballs (RFC 8305) in OCaml with lwt
ISC License
22 stars 4 forks source link

DNS resolver failure mode #39

Open reynir opened 5 months ago

reynir commented 5 months ago

On my home network the router listens on TCP port 53, but when querying the DNS resolver over TCP the resolver does not respond. This is (an annoying) failure mode we currently don't handle.

$ nc -v 192.168.1.1 53
Connection to 192.168.1.1 53 port [tcp/domain] succeeded!
^C
$ dig +tcp @192.168.1.1 reyn.ir
;; communications error to 192.168.1.1#53: timed out
;; communications error to 192.168.1.1#53: timed out
;; communications error to 192.168.1.1#53: timed out

; <<>> DiG 9.18.24-1-Debian <<>> +tcp @192.168.1.1 reyn.ir
; (1 server found)
;; global options: +cmd
;; no servers could be reached
reynir commented 5 months ago

I observe this in http-lwt-client with a DNS timeout even if I have a more responsive name server as second entry in resolv.conf.

hannesm commented 5 months ago

Hmm, so there are two ways forward I guess:

I'm a fan of (b1). And still undecided about (b2) or (a) -- while (b2) has the advantage that we won't have to mess around with it anymore, the disadvantage is that libc resolver is used (i.e. potential security issues, also using dns-client less leads to eventually more bugs in it). The disadvantage of (a) is that it is rather complicated (when to use tcp / when to use udp, and esp. in scenarios described above what the default should be and what the error behaviour should be).

I remember that the dns-resolver code has some parts about retransmitting queries and using TCP if truncated etc. -- would be nice to leverage (maybe first test it more and debug issues) that code for reuse between dns-client and dns-resolver eventually.

So, which path to take? Should we take a look at (b1) at least, from that point it'd be easier to move I suspect. (And both Unix.getaddrinfo and dns-client-lwt could be options, the only question is what to use as default -- and currently I lean towards getaddrinfo). WDYT?

gsportrix commented 5 months ago

I don't know if this is helpfull at all, but i found that i can only connect to some servers at home using the http-lwt-client. while www.ocaml.org shows a DNS request timeout www.google.com works.

i've tried some more it feels like two out of 10 work...

i don't see any differences in the output of dig

gsportrix commented 5 months ago

One thing... yesterday evening it worked at home... oh i am connected to my companies VPN. Turning off VPN ... DNS request timeout - Dig query times in the 10 to 25 ms range Turning on VPN everything is fine again... Dig query times in the 5 to 15 ms range

reynir commented 5 months ago

I think it is worthwhile to implement udp in dns-client-lwt either way.

My mental model of this is that the DNS happy-eyeballs observes a successful TCP handshake and considers it a done deal. Then my resolver doesn't reply and the request times out. It seems no other nameservers are attempted since doing the TCP handshake is considered a success?! So I don't know if we should try to communicate back to happy-eyeballs "don't try this nameserver+port next".

@gsportrix Can you test with dig +tcp and see if that fails?

gsportrix commented 5 months ago

@reynir i have tried dig +tcp works in both cases. just the dns itself changed.

So i tried different DNS 8.8.8.8 works with http-lwt-client 1.1.1.1 works with http-lwt-client

84.200.69.80 works sometimes with http-lwt-client

sometimes not:

Dig outputs...

➜  dig +tcp @84.200.69.80 www.ocaml.org

; <<>> DiG 9.10.6 <<>> +tcp @84.200.69.80 www.ocaml.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25497
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ocaml.org.         IN  A

;; ANSWER SECTION:
www.ocaml.org.      155 IN  A   51.159.83.169

;; Query time: 758 msec
;; SERVER: 84.200.69.80#53(84.200.69.80)
;; WHEN: Wed Mar 13 22:02:28 CET 2024
;; MSG SIZE  rcvd: 58

➜  dig +tcp @84.200.69.80 www.ocaml.org
;; Connection to 84.200.69.80#53(84.200.69.80) for www.ocaml.org failed: timed out.
;; Connection to 84.200.69.80#53(84.200.69.80) for www.ocaml.org failed: timed out.

; <<>> DiG 9.10.6 <<>> +tcp @84.200.69.80 www.ocaml.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45860
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ocaml.org.         IN  A

;; ANSWER SECTION:
www.ocaml.org.      134 IN  A   51.159.83.169

;; Query time: 14 msec
;; SERVER: 84.200.69.80#53(84.200.69.80)
;; WHEN: Wed Mar 13 22:02:50 CET 2024
;; MSG SIZE  rcvd: 58

So for now it's just my silly homebox that does not work without any hints in dig...

hannesm commented 3 months ago

While this issue has not been addressed, since happy-eyeballs 1.0.0, http-lwt-client will use the standard getaddrinfo() interface instead of the DNS stack developed in OCaml. So, if you upgrade to happy-eyeballs >= 1.0.0, you shouldn't encounter this issue anymore with http-lwt-client.

I will leave this issue open, since it seems we still should improve the failure mode of the DNS client.