Open CVTJNII opened 7 years ago
We've made some changes to the SOA and the NS responses of consul in 0.9.1 of which the gist is in here: https://github.com/hashicorp/consul/pull/3353#issuecomment-320934137
However, there is a panic in the code that is fixed here: https://github.com/hashicorp/consul/pull/3408 which is only on master right now. The panic is triggered when you query for the SOA record directly. We should have a 0.9.3 release out soon.
Also, I agree that the SOA fields should be configurable and I'm going to pick this up after the config refactoring I'm working on.
We have the same scenario and behaviour as described by @CVTJNII. We are using Bind, I'm having trouble to find a workaround.
Do you guys have any ideias?
Consul version: v1.3.0
check_monitors = 1
check_ttls = 0
checks = 6
services = 1
build:
prerelease =
revision = e8757838
version = 1.3.0
consul:
bootstrap = false
known_datacenters = 2
leader = false
leader_addr = 10.94.120.18:8300
server = true
raft:
applied_index = 5419372
commit_index = 5419372
fsm_pending = 0
last_contact = 40.821696ms
last_log_index = 5419372
last_log_term = 1555
last_snapshot_index = 5409030
last_snapshot_term = 1555
latest_configuration = [{Suffrage:Voter ID:8034b686-0ed8-750e-4b29-da34f35efb44 Address:10.94.120.6:8300} {Suffrage:Voter ID:fb05db25-880b-820a-6824-83bb18642377 Address:10.94.120.18:8300} {Suffrage:Voter ID:27f328ce-fc79-026e-c9f2-23055610d461 Address:10.94.120.17:8300}]
latest_configuration_index = 3639368
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 1555
runtime:
arch = amd64
cpu_count = 4
goroutines = 199
max_procs = 4
os = linux
version = go1.11.1
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 155
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 568
members = 36
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 253
members = 6
query_queue = 0
query_time = 1```
Same issue consul 1.6.2 :(
@CVTJNII , @magiconair did you found a workaround ?
Hi @olivierHa,
Would you mind providing a bit more detail about your environment, and the exact error you're seeing in the Consul & DNS server logs?
I was able to successfully configure BIND to forward queries to Consul using standard DNS delegation as well as using a static-stub
zone type. It seems subdomain delegation & lookups should work. It would be helpful to have more details about your environment to troubleshoot.
Thanks.
it is happening the same to me, but only, with the alt-domain
and the main domain is working properly.
for instance I have the first domain: ha.domain.net
(it works)
and I have alt-domain: ha.domain.org
(it doesn't work).
The logs throw an error on MX, and AAAA record:
Mar 23 13:47:16 prod-consul02.domain.net named[32637]: FORMERR resolving 'puppet7.service.ha.domain.org/AAAA/IN': 127.0.0.1#8600
Mar 23 13:47:16 prod-consul02.domain.net named[32637]: FORMERR resolving 'puppet7.service.ha.domain.org/MX/IN': 127.0.0.1#8600
Please let me know how can I help.
Any progress on this?
We have consul running on a linux box with bind in front. This is being used by Windows DNS as conditional forwarder for amongst other things service.consul domain.
So Windows DNS has conditional forwarder service.consul -> BIND BIND has zone service.consul of type forward forwarding to consul consul provides service.consul resolution.
Then we have linux box running docker that needs to pull images via proxy.service.consul.
Now what happens is that the docker host queries for proxy.service.consul. It (nearly instantly) gets an A record, but it also wants AAAA due to v6 stack preference. AAAA doesn't exist however. It attempts to get this twice before falling back. Unfortunately at that time docker has killed the process as it's been waiting for over 20s for any kind of response.
In this situation proxy.service.consul is forwarded to Windows DNS, which forwards to BIND, which forwards to consul.
BIND sees this response:
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @a.b.c.d proxy.service.consul aaaa ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33198 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;proxy.service.consul. IN AAAA
;; AUTHORITY SECTION: consul. 0 IN SOA ns.consul. hostmaster.consul. 1644910405 3600 600 86400 0
;; Query time: 0 msec ;; SERVER: a.b.c.d ;; WHEN: di feb 15 08:33:25 CET 2022 ;; MSG SIZE rcvd: 99
Which logs these lines in named query logs:
2022-02-15_07:34:43.13423 15-Feb-2022 08:34:43.133 DNS format error from a.b.c.d resolving proxy.service.consul/AAAA for client 127.0.0.1#35439: Name consul (SOA) not subdomain of zone service.consul -- invalid response 2022-02-15_07:34:43.13424 15-Feb-2022 08:34:43.133 FORMERR resolving 'proxy.service.consul/AAAA/IN': a.b.c.d
Windows DNS gets this response:
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @localhost proxy.service.consul aaaa ; (2 servers found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 431 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;proxy.service.consul. IN AAAA
;; Query time: 17 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: di feb 15 08:34:43 CET 2022 ;; MSG SIZE rcvd: 49
Due to the SERVFAIL and not an empty NOERROR response (on both bind -> consul hosts - they're redundant), Windows DNS at that point falls back to using root hints. That traffic is dropped by firewall causing the huge delays, but it probably wouldn't help much if it were allowed as .consul isn't available externally.
Best solution would be to get BIND to return a NOERROR instead of SERVFAIL imho. Easiest way to get that working seems to be to get a NOERROR back with a SOA pointing to a NS that resolves.
Forwarding the entire consul. zone to consul doesn't help by the way, as it doesn't resolve ns.consul.
Found some other referrals to this: https://github.com/hashicorp/consul/pull/1798 https://github.com/hashicorp/consul/issues/1755
Any progress on this issue?
Consul version: v0.7.2
Server information:
Description of the Issue (and unexpected/desired result)
We are delegating a subdomain from Bind9 to Consul for service discovery. We have configured the datacenter and domain values properly and delegation of A records work. However, lookups of AAAA records for valid services fail. This problem is limited to AAAA lookups of valid services as the service has valid A records for ipv4, but no AAAA records as the backing servers currently do not have ipv6 addresses. Lookups of invalid services pass as Consul returns NXDOMAIN for invalid services.
In troubleshooting on the Bind9 side I see Bind is reporting FORMERR. Note the following log snippet is sanitized:
In reading about similar issues on the Bind9 user list I believe this is due to the SOA record being incorrect. See the following post with the comment "This one fails to return the CNAME to content.sjc1.site.voxcdn.net when the query type is AAAA so you get a unrelated SOA record." https://groups.google.com/forum/#!topic/comp.protocols.dns.bind/B-9RPmaJdjQ This makes sense for my error as I see the empty NOERROR response for the AAAA lookup returns a SOA record with ns.domain as the authorative nameserver, which is wrong.
Looking over the Consul docs I do not see how to configure the SOA record for the delegated domain in Consul, based on the docs at https://www.consul.io/docs/agent/options.html#dns_config I am under the impression ns.domain and postmaster.doman are hardcoded defaults. I see PR #1798 was opened to allow this record to be settable, but the author closed the PR without it being merged.
This is a nuisance problem as, while the A record lookup works, Bind is passing SERVFAIL to clients trying to look up AAAA records first because it is rejecting the response from Consul and as such cannot get a response itself. The clients retry on SERVFAIL until they timeout and fallback to the A record, adding about 10s to all API requests to services using Consul DNS in our environment.