haproxy / haproxy

HAProxy Load Balancer's development branch (mirror of git.haproxy.org)
https://git.haproxy.org/
Other
5.05k stars 801 forks source link

haproxy DNS resolution fails with kubernetes generated resolv.conf #1421

Closed telmich closed 2 years ago

telmich commented 3 years ago

Detailed Description of the Problem

Running haproxy under kubernetes using a resolver fails with an "unspecified DNS error".

The resolv.conf in question is:

search default.svc.p10.k8s.ooo svc.p10.k8s.ooo p10.k8s.ooo place10.ungleich.ch
nameserver 2a0a:e5c0:10:3::a
options ndots:5

Expected Behavior

It parses the resolv.conf and does DNS resolution.

Steps to Reproduce the Behavior

  1. Use above resolv.conf
  2. Use the following haproxy.cfg:
global
    log stdout format raw local0

defaults
    retries                 3
    log global
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s

# Add resolver support so that we can ignore resolving issues
resolvers mydns
  parse-resolv-conf

frontend f_http
    bind ipv6@:80
    mode http
  use_backend http_http.ungleich.ch if { hdr(host) -i http.ungleich.ch }
  use_backend http_www.ungleich.ch if { hdr(host) -i www.ungleich.ch }
backend http_http.ungleich.ch
  mode http
  server http.ungleich.ch ipv6@http.ungleich.ch resolvers mydns
backend http_www.ungleich.ch
  mode http
  server www.ungleich.ch ipv6@www.ungleich.ch resolvers mydns

Observe the following:

Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Do you have any idea what may have caused this?

Probably the length of the search directive. Deleting that line makes it work.

Do you have an idea how to solve the issue?

See above

What is your configuration?

nb3:~# haproxy -v
HAProxy version 2.4.7-b5e51a5 2021/10/04 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.7.html
Running on: Linux 5.14.13-0-edge #1-Alpine SMP PREEMPT Mon, 18 Oct 2021 08:09:50 +0000 x86_64

Tested both native on Alpine and also in containers with Alpine and Debian. This is not an Musl problem.


### Output of `haproxy -vv`

```plain
nb3:~# haproxy -vv
HAProxy version 2.4.7-b5e51a5 2021/10/04 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.7.html
Running on: Linux 5.14.13-0-edge #1-Alpine SMP PREEMPT Mon, 18 Oct 2021 08:09:50 +0000 x86_64
Build options :
  TARGET  = linux-musl
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_NS=1 USE_PROMEX=1
  DEBUG   = 

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE -PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED -BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL -PROCCTL +THREAD_DUMP -EVPORTS -OT -QUIC +PROMEX -MEMORY_PROFILING

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1l  24 Aug 2021
Running on OpenSSL version : OpenSSL 1.1.1l  24 Aug 2021
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.4.3
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.45 2021-06-15
Running on PCRE version : 8.45 2021-06-15
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 10.3.1 20210921

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2       flags=HTX|CLEAN_ABRT|HOL_RISK|NO_UPG
            fcgi : mode=HTTP       side=BE        mux=FCGI     flags=HTX|HOL_RISK|NO_UPG
       <default> : mode=HTTP       side=FE|BE     mux=H1       flags=HTX
              h1 : mode=HTTP       side=FE|BE     mux=H1       flags=HTX|NO_UPG
       <default> : mode=TCP        side=FE|BE     mux=PASS     flags=
            none : mode=TCP        side=FE|BE     mux=PASS     flags=NO_UPG

Available services : prometheus-exporter
Available filters :
    [SPOE] spoe
    [CACHE] cache
    [FCGI] fcgi-app
    [COMP] compression
    [TRACE] trace

Note: I am using the 2.4.7 and 2.4.7-alpine images for testing.



### Last Outputs and Backtraces

_No response_

### Additional Information

_No response_
wtarreau commented 3 years ago

OK, I was worried that IPv6 would be broken. It could indeed be the parser that chokes on "search". However when reading the parser, it seems to ignore any line but those starting with "nameserver".

There could be another explanation. It reads the config in lines of LINESIZE characters max. LINESIZE is defined to 2048 only when not already defined. It could be possible that in your environment it's already defined to a much lower value and that it breaks the parser. Well, looking at the code closer, even then it should not do that given that there's no "nameserver" word on your "search" line that could be matched by accident. Are you certain this is the only difference ?

nicoschottelius-lf commented 3 years ago

If it was 2048, it should not be a problem, as the line is only 78 characters, excluding the line break:

% echo -n  search default.svc.p10.k8s.ooo svc.p10.k8s.ooo p10.k8s.ooo place10.ungleich.ch  | wc -c
78

I am testing now again, because I also don't spot the obvious bug in resolvers.c. Using the "broken" resolv.conf from above:

[16:36] bridge:~% date; cat /etc/resolv.conf; sudo haproxy -f ./haproxy.cfg; date
Wed Oct 20 16:36:29 CEST 2021
search default.svc.p10.k8s.ooo svc.p10.k8s.ooo p10.k8s.ooo place10.ungleich.ch
nameserver 2a0a:e5c0:10:3::a
options ndots:5

[WARNING]  (25815) : parsing [./haproxy.cfg:26] : 'server http.ungleich.ch' : could not resolve address 'http.ungleich.ch', disabling server.
[WARNING]  (25815) : Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[NOTICE]   (25815) : haproxy version is 2.4.7-b5e51a5
[ALERT]    (25815) : backend 'http_www.ungleich.ch' has no server available!
backend http_www.ungleich.ch has no server available!
^C
[16:37] bridge:~% date
Wed Oct 20 16:37:06 CEST 2021
[16:37] bridge:~% 

It fails to resolve www.ungleich.ch after 10s, when the resolver kicks in. Testing manual resolution, it works perfectly fine:

[16:37] bridge:~% dig +short www.ungleich.ch a
dynamicweb-production.ungleich.ch.
185.203.112.17
[16:37] bridge:~% dig +short www.ungleich.ch aaaa
dynamicweb-production.ungleich.ch.
2a0a:e5c0:0:2:400:b3ff:fe39:795c
[16:37] bridge:~% 

Now deleting the search line, waiting for 10+ seconds, the problem STILL exists!

[16:38] bridge:~% date; cat /etc/resolv.conf; sudo haproxy -f ./haproxy.cfg
Wed Oct 20 16:38:30 CEST 2021
nameserver 2a0a:e5c0:10:3::a
options ndots:5

[WARNING]  (26239) : parsing [./haproxy.cfg:26] : 'server http.ungleich.ch' : could not resolve address 'http.ungleich.ch', disabling server.
[WARNING]  (26239) : Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[NOTICE]   (26239) : haproxy version is 2.4.7-b5e51a5
[ALERT]    (26239) : backend 'http_www.ungleich.ch' has no server available!
backend http_www.ungleich.ch has no server available!
^C
[16:39] bridge:~% date
Wed Oct 20 16:39:10 CEST 2021
[16:39] bridge:~% 

So my first finding seems to have been incorrect! Now testing with also remove the options ndots:5, it also fails:

[16:40] bridge:~% date; cat /etc/resolv.conf; sudo haproxy -f ./haproxy.cfg      
Wed Oct 20 16:40:24 CEST 2021
nameserver 2a0a:e5c0:10:3::a

[WARNING]  (26636) : parsing [./haproxy.cfg:26] : 'server http.ungleich.ch' : could not resolve address 'http.ungleich.ch', disabling server.
[WARNING]  (26636) : Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[NOTICE]   (26636) : haproxy version is 2.4.7-b5e51a5
[ALERT]    (26636) : backend 'http_www.ungleich.ch' has no server available!
backend http_www.ungleich.ch has no server available!
^C
[16:41] bridge:~% date
Wed Oct 20 16:41:04 CEST 2021
[16:41] bridge:~% 

So the only line left in /etc/resolv.conf is

nameserver 2a0a:e5c0:10:3::a                                                  

Triple checking that the resolv.conf works for dig and ping:

[16:42] bridge:~% dig +short www.ungleich.ch a 
dynamicweb-production.ungleich.ch.
185.203.112.17
[16:42] bridge:~% dig +short www.ungleich.ch aaaa
dynamicweb-production.ungleich.ch.
2a0a:e5c0:0:2:400:b3ff:fe39:795c
[16:42] bridge:~% cat /etc/resolv.conf
nameserver 2a0a:e5c0:10:3::a

[16:42] bridge:~% ping -c2 -6 www.ungleich.ch
PING www.ungleich.ch (2a0a:e5c0:0:2:400:b3ff:fe39:795c): 56 data bytes
64 bytes from 2a0a:e5c0:0:2:400:b3ff:fe39:795c: seq=0 ttl=62 time=1.740 ms
64 bytes from 2a0a:e5c0:0:2:400:b3ff:fe39:795c: seq=1 ttl=62 time=1.546 ms

--- www.ungleich.ch ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.546/1.643/1.740 ms
[16:42] bridge:~% ping -c2 -4 www.ungleich.ch 
PING www.ungleich.ch (185.203.112.17): 56 data bytes

--- www.ungleich.ch ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss
[16:42] bridge:~% 

(ignore the ping failure for IPv4, these are IPv6 only networks - important is that the resolving works)

Now trying with a different nameserver, it seems to work, even 30s after startup:

[16:45] bridge:~% date; cat /etc/resolv.conf; sudo haproxy -f ./haproxy.cfg
Wed Oct 20 16:45:59 CEST 2021
nameserver 2a0a:e5c0:10:a::a

[WARNING]  (27801) : parsing [./haproxy.cfg:26] : 'server http.ungleich.ch' : could not resolve address 'http.ungleich.ch', disabling server.
Connect from ::ffff:127.0.0.1:49018 to ::ffff:127.0.0.1:80 (f_http/HTTP)
^C
[16:47] bridge:~% date
Wed Oct 20 16:47:25 CEST 2021
[16:47] bridge:~% 

So what are the differences?

I have take two pcap dumps and they both look pretty much the same to me, so I wonder what makes haproxy work in one case and not work in the other case.

haproxy-bind.pcap.gz haproxy-coredns.pcap.gz

Does that make any sense to you?

wtarreau commented 3 years ago

Strange, I cannot open your pcaps with either tcpdump or wireshark, while the gunzip phase is OK. Did you transfer them over FTP before compressing them maybe ? I had a quick look at the binary contents but can't spot anything obvious. All seem to advertise all records (I'm probably wrong as I don't speak DNS fluently in hex), however I'm seeing coredns advertising root servers while bind does not. Maybe this could be related to what you're seeing ? I confess at this point I'm a bit lost. I'm adding @bedis and @EmericBr since they're those who speak DNS better than me and know better how this is supposed to work.

lukastribus commented 3 years ago

The coreDNS response with all the verbose data (root servers in additional section, etc) exceeds the default requested UDP payload size of 512, therefor the response is TRUNCATED, and as such, ignored by haproxy.

Raise accepted_payload_size in the resolver section to the recommended 4096 bytes.

http://cbonte.github.io/haproxy-dconv/2.4/configuration.html#5.3.2-accepted_payload_size

wtarreau commented 3 years ago

I didn't notice, thanks Lukas! I was pretty sure I saw the parameter here but I confused with another issue. I think the DNS error reporting needs a serious lifting so that such issues can be detected easier by users. For example we could imagine keeping a counter of truncated responses (if there is no one yet), and when a server goes down due to resolutions error, if the counter is not nul, we display it and suggest that maybe this parameter ought to be played with.

capflam commented 2 years ago

Can we close this issue ?

telmich commented 2 years ago

Yes!

capflam commented 2 years ago

Thanks !

jonaz commented 2 years ago

This worked without accepted_payload_size for us in 2.4.4 but now in 2.4.12 we had to set it to be able to get backends up.