hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Bind9 FORMERR on AAAA record lookups when delegating subdomain to Consul #3439

Open CVTJNII opened 7 years ago

CVTJNII commented 7 years ago

Consul version: v0.7.2

Server information:

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 1
build:
        prerelease = 
        revision = 'a9afa0c
        version = 0.7.2
consul:
        bootstrap = false
        known_datacenters = 3
        leader = false
        leader_addr = [internal ip]:8300
        server = true
raft:
        applied_index = 57746118
        commit_index = 57746118
        fsm_pending = 0
        last_contact = 7.289316ms
        last_log_index = 57746118
        last_log_term = 3528
        last_snapshot_index = 57744807
        last_snapshot_term = 3528
        latest_configuration = [{Suffrage:Voter ID:[internal ip]:8300 Address:[internal ip]:8300} {Suffrage:Voter ID:[internal ip]:8300 Address:[internal ip]:8300} {Suffrage:Voter ID:[internal ip]:8300 Address:[internal ip]:8300}]
        latest_configuration_index = 1
        num_peers = 2
        protocol_version = 1
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 3528
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 224
        max_procs = 2
        os = linux
        version = go1.7.3
serf_lan:
        encrypted = true
        event_queue = 0
        event_time = 1368
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 31
        members = 6
        query_queue = 0
        query_time = 1
serf_wan:
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 50
        members = 9
        query_queue = 0
        query_time = 1

Description of the Issue (and unexpected/desired result)

We are delegating a subdomain from Bind9 to Consul for service discovery. We have configured the datacenter and domain values properly and delegation of A records work. However, lookups of AAAA records for valid services fail. This problem is limited to AAAA lookups of valid services as the service has valid A records for ipv4, but no AAAA records as the backing servers currently do not have ipv6 addresses. Lookups of invalid services pass as Consul returns NXDOMAIN for invalid services.

In troubleshooting on the Bind9 side I see Bind is reporting FORMERR. Note the following log snippet is sanitized:

31-Aug-2017 21:52:14.490 resolver: debug 3: resquery 0x7fa7a8229010 (fctx 0x7fa7a8223010(consul.service.dc.domain/AAAA)): response
31-Aug-2017 21:52:14.490 resolver: debug 10: received packet:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id:  41509
;; flags: qr aa; QUESTION: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;consul.service.dc.domain. IN AAAA

;; AUTHORITY SECTION:
domain.       0       IN      SOA     ns.domain. postmaster.domain. 1504216334 3600 600 86400 0

31-Aug-2017 21:52:14.490 resolver: debug 3: fctx 0x7fa7a8223010(consul.service.dc.domain/AAAA'): noanswer_response
31-Aug-2017 21:52:14.490 resolver: debug 10: log_ns_ttl: fctx 0x7fa7a8223010: noanswer_response: consul.service.dc.domain (in 'dc.domain'?): 1 30
31-Aug-2017 21:52:14.490 resolver: notice: DNS format error from [internal ip]#53 resolving consul.service.dc.domain/AAAA for client 127.0.0.1#56536: invalid response

In reading about similar issues on the Bind9 user list I believe this is due to the SOA record being incorrect. See the following post with the comment "This one fails to return the CNAME to content.sjc1.site.voxcdn.net when the query type is AAAA so you get a unrelated SOA record." https://groups.google.com/forum/#!topic/comp.protocols.dns.bind/B-9RPmaJdjQ This makes sense for my error as I see the empty NOERROR response for the AAAA lookup returns a SOA record with ns.domain as the authorative nameserver, which is wrong.

Looking over the Consul docs I do not see how to configure the SOA record for the delegated domain in Consul, based on the docs at https://www.consul.io/docs/agent/options.html#dns_config I am under the impression ns.domain and postmaster.doman are hardcoded defaults. I see PR #1798 was opened to allow this record to be settable, but the author closed the PR without it being merged.

This is a nuisance problem as, while the A record lookup works, Bind is passing SERVFAIL to clients trying to look up AAAA records first because it is rejecting the response from Consul and as such cannot get a response itself. The clients retry on SERVFAIL until they timeout and fallback to the A record, adding about 10s to all API requests to services using Consul DNS in our environment.

CVTJNII commented 7 years ago

1301 may also resolve this as (assuming my understanding is correct) if Consul properly returns records for ns.domain then the SOA in the AAAA response will be valid.

magiconair commented 7 years ago

We've made some changes to the SOA and the NS responses of consul in 0.9.1 of which the gist is in here: https://github.com/hashicorp/consul/pull/3353#issuecomment-320934137

However, there is a panic in the code that is fixed here: https://github.com/hashicorp/consul/pull/3408 which is only on master right now. The panic is triggered when you query for the SOA record directly. We should have a 0.9.3 release out soon.

Also, I agree that the SOA fields should be configurable and I'm going to pick this up after the config refactoring I'm working on.

pxior commented 5 years ago

We have the same scenario and behaviour as described by @CVTJNII. We are using Bind, I'm having trouble to find a workaround.

Do you guys have any ideias?

Consul version: v1.3.0


    check_monitors = 1
    check_ttls = 0
    checks = 6
    services = 1
build:
    prerelease =
    revision = e8757838
    version = 1.3.0
consul:
    bootstrap = false
    known_datacenters = 2
    leader = false
    leader_addr = 10.94.120.18:8300
    server = true
raft:
    applied_index = 5419372
    commit_index = 5419372
    fsm_pending = 0
    last_contact = 40.821696ms
    last_log_index = 5419372
    last_log_term = 1555
    last_snapshot_index = 5409030
    last_snapshot_term = 1555
    latest_configuration = [{Suffrage:Voter ID:8034b686-0ed8-750e-4b29-da34f35efb44 Address:10.94.120.6:8300} {Suffrage:Voter ID:fb05db25-880b-820a-6824-83bb18642377 Address:10.94.120.18:8300} {Suffrage:Voter ID:27f328ce-fc79-026e-c9f2-23055610d461 Address:10.94.120.17:8300}]
    latest_configuration_index = 3639368
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 1555
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 199
    max_procs = 4
    os = linux
    version = go1.11.1
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 155
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 568
    members = 36
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 253
    members = 6
    query_queue = 0
    query_time = 1```
olivierHa commented 4 years ago

Same issue consul 1.6.2 :(

@CVTJNII , @magiconair did you found a workaround ?

blake commented 4 years ago

Hi @olivierHa,

Would you mind providing a bit more detail about your environment, and the exact error you're seeing in the Consul & DNS server logs?

I was able to successfully configure BIND to forward queries to Consul using standard DNS delegation as well as using a static-stub zone type. It seems subdomain delegation & lookups should work. It would be helpful to have more details about your environment to troubleshoot.

Thanks.

maxadamo commented 3 years ago

it is happening the same to me, but only, with the alt-domain and the main domain is working properly. for instance I have the first domain: ha.domain.net (it works) and I have alt-domain: ha.domain.org (it doesn't work). The logs throw an error on MX, and AAAA record:

Mar 23 13:47:16 prod-consul02.domain.net named[32637]: FORMERR resolving 'puppet7.service.ha.domain.org/AAAA/IN': 127.0.0.1#8600
Mar 23 13:47:16 prod-consul02.domain.net named[32637]: FORMERR resolving 'puppet7.service.ha.domain.org/MX/IN': 127.0.0.1#8600

Please let me know how can I help.

freakynl commented 2 years ago

Any progress on this?

We have consul running on a linux box with bind in front. This is being used by Windows DNS as conditional forwarder for amongst other things service.consul domain.

So Windows DNS has conditional forwarder service.consul -> BIND BIND has zone service.consul of type forward forwarding to consul consul provides service.consul resolution.

Then we have linux box running docker that needs to pull images via proxy.service.consul.

Now what happens is that the docker host queries for proxy.service.consul. It (nearly instantly) gets an A record, but it also wants AAAA due to v6 stack preference. AAAA doesn't exist however. It attempts to get this twice before falling back. Unfortunately at that time docker has killed the process as it's been waiting for over 20s for any kind of response.

In this situation proxy.service.consul is forwarded to Windows DNS, which forwards to BIND, which forwards to consul.

BIND sees this response:

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @a.b.c.d proxy.service.consul aaaa ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33198 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;proxy.service.consul. IN AAAA

;; AUTHORITY SECTION: consul. 0 IN SOA ns.consul. hostmaster.consul. 1644910405 3600 600 86400 0

;; Query time: 0 msec ;; SERVER: a.b.c.d ;; WHEN: di feb 15 08:33:25 CET 2022 ;; MSG SIZE rcvd: 99

Which logs these lines in named query logs:

2022-02-15_07:34:43.13423 15-Feb-2022 08:34:43.133 DNS format error from a.b.c.d resolving proxy.service.consul/AAAA for client 127.0.0.1#35439: Name consul (SOA) not subdomain of zone service.consul -- invalid response 2022-02-15_07:34:43.13424 15-Feb-2022 08:34:43.133 FORMERR resolving 'proxy.service.consul/AAAA/IN': a.b.c.d

Windows DNS gets this response:

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @localhost proxy.service.consul aaaa ; (2 servers found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 431 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;proxy.service.consul. IN AAAA

;; Query time: 17 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: di feb 15 08:34:43 CET 2022 ;; MSG SIZE rcvd: 49

Due to the SERVFAIL and not an empty NOERROR response (on both bind -> consul hosts - they're redundant), Windows DNS at that point falls back to using root hints. That traffic is dropped by firewall causing the huge delays, but it probably wouldn't help much if it were allowed as .consul isn't available externally.

Best solution would be to get BIND to return a NOERROR instead of SERVFAIL imho. Easiest way to get that working seems to be to get a NOERROR back with a SOA pointing to a NS that resolves.

Forwarding the entire consul. zone to consul doesn't help by the way, as it doesn't resolve ns.consul.

Found some other referrals to this: https://github.com/hashicorp/consul/pull/1798 https://github.com/hashicorp/consul/issues/1755

Crapshit commented 7 months ago

Any progress on this issue?