Closed telmich closed 2 years ago
OK, I was worried that IPv6 would be broken. It could indeed be the parser that chokes on "search". However when reading the parser, it seems to ignore any line but those starting with "nameserver".
There could be another explanation. It reads the config in lines of LINESIZE
characters max. LINESIZE is defined to 2048 only when not already defined. It could be possible that in your environment it's already defined to a much lower value and that it breaks the parser. Well, looking at the code closer, even then it should not do that given that there's no "nameserver" word on your "search" line that could be matched by accident. Are you certain this is the only difference ?
If it was 2048, it should not be a problem, as the line is only 78 characters, excluding the line break:
% echo -n search default.svc.p10.k8s.ooo svc.p10.k8s.ooo p10.k8s.ooo place10.ungleich.ch | wc -c
78
I am testing now again, because I also don't spot the obvious bug in resolvers.c. Using the "broken" resolv.conf from above:
[16:36] bridge:~% date; cat /etc/resolv.conf; sudo haproxy -f ./haproxy.cfg; date
Wed Oct 20 16:36:29 CEST 2021
search default.svc.p10.k8s.ooo svc.p10.k8s.ooo p10.k8s.ooo place10.ungleich.ch
nameserver 2a0a:e5c0:10:3::a
options ndots:5
[WARNING] (25815) : parsing [./haproxy.cfg:26] : 'server http.ungleich.ch' : could not resolve address 'http.ungleich.ch', disabling server.
[WARNING] (25815) : Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[NOTICE] (25815) : haproxy version is 2.4.7-b5e51a5
[ALERT] (25815) : backend 'http_www.ungleich.ch' has no server available!
backend http_www.ungleich.ch has no server available!
^C
[16:37] bridge:~% date
Wed Oct 20 16:37:06 CEST 2021
[16:37] bridge:~%
It fails to resolve www.ungleich.ch after 10s, when the resolver kicks in. Testing manual resolution, it works perfectly fine:
[16:37] bridge:~% dig +short www.ungleich.ch a
dynamicweb-production.ungleich.ch.
185.203.112.17
[16:37] bridge:~% dig +short www.ungleich.ch aaaa
dynamicweb-production.ungleich.ch.
2a0a:e5c0:0:2:400:b3ff:fe39:795c
[16:37] bridge:~%
Now deleting the search line, waiting for 10+ seconds, the problem STILL exists!
[16:38] bridge:~% date; cat /etc/resolv.conf; sudo haproxy -f ./haproxy.cfg
Wed Oct 20 16:38:30 CEST 2021
nameserver 2a0a:e5c0:10:3::a
options ndots:5
[WARNING] (26239) : parsing [./haproxy.cfg:26] : 'server http.ungleich.ch' : could not resolve address 'http.ungleich.ch', disabling server.
[WARNING] (26239) : Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[NOTICE] (26239) : haproxy version is 2.4.7-b5e51a5
[ALERT] (26239) : backend 'http_www.ungleich.ch' has no server available!
backend http_www.ungleich.ch has no server available!
^C
[16:39] bridge:~% date
Wed Oct 20 16:39:10 CEST 2021
[16:39] bridge:~%
So my first finding seems to have been incorrect! Now testing with also remove the options ndots:5
, it also fails:
[16:40] bridge:~% date; cat /etc/resolv.conf; sudo haproxy -f ./haproxy.cfg
Wed Oct 20 16:40:24 CEST 2021
nameserver 2a0a:e5c0:10:3::a
[WARNING] (26636) : parsing [./haproxy.cfg:26] : 'server http.ungleich.ch' : could not resolve address 'http.ungleich.ch', disabling server.
[WARNING] (26636) : Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Server http_www.ungleich.ch/www.ungleich.ch is going DOWN for maintenance (unspecified DNS error). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[NOTICE] (26636) : haproxy version is 2.4.7-b5e51a5
[ALERT] (26636) : backend 'http_www.ungleich.ch' has no server available!
backend http_www.ungleich.ch has no server available!
^C
[16:41] bridge:~% date
Wed Oct 20 16:41:04 CEST 2021
[16:41] bridge:~%
So the only line left in /etc/resolv.conf is
nameserver 2a0a:e5c0:10:3::a
Triple checking that the resolv.conf works for dig and ping:
[16:42] bridge:~% dig +short www.ungleich.ch a
dynamicweb-production.ungleich.ch.
185.203.112.17
[16:42] bridge:~% dig +short www.ungleich.ch aaaa
dynamicweb-production.ungleich.ch.
2a0a:e5c0:0:2:400:b3ff:fe39:795c
[16:42] bridge:~% cat /etc/resolv.conf
nameserver 2a0a:e5c0:10:3::a
[16:42] bridge:~% ping -c2 -6 www.ungleich.ch
PING www.ungleich.ch (2a0a:e5c0:0:2:400:b3ff:fe39:795c): 56 data bytes
64 bytes from 2a0a:e5c0:0:2:400:b3ff:fe39:795c: seq=0 ttl=62 time=1.740 ms
64 bytes from 2a0a:e5c0:0:2:400:b3ff:fe39:795c: seq=1 ttl=62 time=1.546 ms
--- www.ungleich.ch ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.546/1.643/1.740 ms
[16:42] bridge:~% ping -c2 -4 www.ungleich.ch
PING www.ungleich.ch (185.203.112.17): 56 data bytes
--- www.ungleich.ch ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss
[16:42] bridge:~%
(ignore the ping failure for IPv4, these are IPv6 only networks - important is that the resolving works)
Now trying with a different nameserver, it seems to work, even 30s after startup:
[16:45] bridge:~% date; cat /etc/resolv.conf; sudo haproxy -f ./haproxy.cfg
Wed Oct 20 16:45:59 CEST 2021
nameserver 2a0a:e5c0:10:a::a
[WARNING] (27801) : parsing [./haproxy.cfg:26] : 'server http.ungleich.ch' : could not resolve address 'http.ungleich.ch', disabling server.
Connect from ::ffff:127.0.0.1:49018 to ::ffff:127.0.0.1:80 (f_http/HTTP)
^C
[16:47] bridge:~% date
Wed Oct 20 16:47:25 CEST 2021
[16:47] bridge:~%
So what are the differences?
I have take two pcap dumps and they both look pretty much the same to me, so I wonder what makes haproxy work in one case and not work in the other case.
haproxy-bind.pcap.gz haproxy-coredns.pcap.gz
Does that make any sense to you?
Strange, I cannot open your pcaps with either tcpdump or wireshark, while the gunzip phase is OK. Did you transfer them over FTP before compressing them maybe ? I had a quick look at the binary contents but can't spot anything obvious. All seem to advertise all records (I'm probably wrong as I don't speak DNS fluently in hex), however I'm seeing coredns advertising root servers while bind does not. Maybe this could be related to what you're seeing ? I confess at this point I'm a bit lost. I'm adding @bedis and @EmericBr since they're those who speak DNS better than me and know better how this is supposed to work.
The coreDNS response with all the verbose data (root servers in additional section, etc) exceeds the default requested UDP payload size of 512, therefor the response is TRUNCATED, and as such, ignored by haproxy.
Raise accepted_payload_size
in the resolver section to the recommended 4096 bytes.
http://cbonte.github.io/haproxy-dconv/2.4/configuration.html#5.3.2-accepted_payload_size
I didn't notice, thanks Lukas! I was pretty sure I saw the parameter here but I confused with another issue. I think the DNS error reporting needs a serious lifting so that such issues can be detected easier by users. For example we could imagine keeping a counter of truncated responses (if there is no one yet), and when a server goes down due to resolutions error, if the counter is not nul, we display it and suggest that maybe this parameter ought to be played with.
Can we close this issue ?
Yes!
Thanks !
This worked without accepted_payload_size
for us in 2.4.4 but now in 2.4.12 we had to set it to be able to get backends up.
Detailed Description of the Problem
Running haproxy under kubernetes using a resolver fails with an "unspecified DNS error".
The resolv.conf in question is:
Expected Behavior
It parses the resolv.conf and does DNS resolution.
Steps to Reproduce the Behavior
Observe the following:
Do you have any idea what may have caused this?
Probably the length of the
search
directive. Deleting that line makes it work.Do you have an idea how to solve the issue?
See above
What is your configuration?
Tested both native on Alpine and also in containers with Alpine and Debian. This is not an Musl problem.
Note: I am using the 2.4.7 and 2.4.7-alpine images for testing.