hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

[bug] vault + consul +dnsmasq = unresolvable ping #3710

Open Justin-DynamicD opened 6 years ago

Justin-DynamicD commented 6 years ago

vault 0.9.0 consul 1.0.0

This behavior is being observed after attempting to emulate the "c1m" or "container 1 million" challenge configuration. I'm posting it under consul as the vault team feels it's a consul problem (I'm not convinced of this as the issue seems unique to vault registration, but see issue #3604)

Setting the Stage Using Ubuntu 16.04 servers, use packer to deploy servers using the scripts from c1m (or take existing boxes and just run the various .sh scripts) . This will result in not only consul being locally installed, but dnsmasq bound to 127.0.0.1 and redirecting /consul/127.0.0.1#8600

Once done, introduce an HA Vault solution in the same Datacenter (not the same server). This will, of course, result in the "vault" service being registered in Consul.

Demonstrating the Issue

Easiest way to demonstrate this issue will be to break things down into steps, so bare with me:

  1. ping randomservice.service.consul <-- success!
  2. ping active.vault.service.consul <-- fails!
  3. ping vault.service.consul <-- fails!
  4. service stop dnsmasq
  5. ping randomservice.service.consul <-- success!
  6. ping active.vault.service.consul <--success!
  7. ping vault.service.consul <-- fails!

So the oddity here is the "active.vault" ping fails when dnsmasq is running, but succeeds when it is stopped. it' also worth mentioning, that an nslookup and dig will both work perfectly fine, and that when you dig vault.service.consul, you get CNAMES instead of A records:

; <<>> DiG 9.10.3-P4-Ubuntu <<>> vault.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13482
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;vault.service.consul.       IN      A

;; ANSWER SECTION:
vault.service.consul. 0      IN      CNAME   lv-pd-sdsc-02[redacted].
vault.service.consul. 0      IN      CNAME   lv-pd-sdsc-01[redacted].
vault.service.consul. 0      IN      CNAME   lv-pd-sdsc-03[redacted].
lv-pd-sdsc-02[redacted]. 1200 IN A   10.10.32.10

;; Query time: 0 msec
;; SERVER: 10.10.12.10#53(10.10.12.10)
;; WHEN: Wed Nov 22 21:06:08 UTC 2017
;; MSG SIZE  rcvd: 172

I'm asking in the consul support because it looks like the culprit might be that somehow vault is registering CNAMES instead of A records, which is technically "bad" (doesn't follow RFC). I'm suspicious of this being why dnsmasq refuses to resolve "active.vault" as it's a sub record.

Again, service resolution for anything OTHER THAN vault will work, unless I change the service name of the vault cluster ... then that new name won't work (basically, the vault service fails to ping regardless of name chosen).

Before closing and claiming it's a vault problem (old api calls or some such), please be aware they have already closed this can called it a consul problem. There's some finger pointing going on here ...

Justin-DynamicD commented 6 years ago

Some googling about send to reveal I'm not the only person to notice this behavior:

https://groups.google.com/forum/m/#!topic/consul-tool/IUp5LvUrGDA

Justin-DynamicD commented 6 years ago

Update:

worked with the vault team and made the discovery: if I set vault variables to only advertise it's IP address to Consul, Consul then appropriately uses A records and all problems are solved. It seems that Consul will indiscriminately return CNAMES for DNS names even when it violates DNS RFC that you should never return more than a single record in the case of a cname.

It seems there should be a behavior update from Consul, especially as fowarders have been introduced and the clear intent to use Consul as full purpose DNS server.

slackpad commented 6 years ago

This does look like we need to limit the number of CNAME responses.

codyja commented 6 years ago

We hit this bug as well.

lokesp11 commented 4 years ago

Hello Team,

For us ping and curl is not at all working for any *.service.consul. Tough nslookup and dig works fine.Please suggest how can I fix it? https://github.com/hashicorp/consul/issues/7587