hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.13k stars 4.41k forks source link

cannot resolve domain/FQDN while joining a exist cluster #1507

Open subchen opened 8 years ago

subchen commented 8 years ago

I use following CLI to start a consul server to join a exist cluster.

$ ping node1
PING node1 (192.168.1.21): 56 data bytes
64 bytes from 192.168.1.21: icmp_seq=0 ttl=64 time=0.053 ms
64 bytes from 192.168.1.21: icmp_seq=0 ttl=64 time=0.053 ms
...

$ consul agent -server -node node2 -retry-join node1

errorlogs (the resolved ip address is wrong)

2015/12/14 06:45:23 [INFO] agent: (LAN) joined: 0 Err: dial tcp 220.250.64.225:8301: i/o timeout
2015/12/14 06:45:23 [WARN] agent: Join failed: dial tcp 220.250.64.225:8301: i/o timeout, retrying in 30s

That is failed to join the node2 to node1 cluster. If I changes node1 to 192.168.1.21, that does work.

$ consul agent -server -node node2 -retry-join 192.168.1.21

My consul version is 0.5.2

Also, consul join <FQDN> does not work.

slackpad commented 8 years ago

Hi @subchen the DNS resolution is being done down in here - https://github.com/hashicorp/memberlist/blob/master/memberlist.go#L218-L234. Is it possible that host has multiple IPs registered with DNS and maybe Go is picking a different one to try?

subchen commented 8 years ago

Hi @slackpad, I only added some records in /etc/hosts to resolve the domain/FQDN.

192.168.1.21    node1
192.168.1.22    node2

I don't know whether the DNS resolve library only uses real DNS and skips the /etc/hosts.

slackpad commented 8 years ago

Will have to dig into Go a little bit to see what it does.

kaskavalci commented 7 years ago

@slackpad We experience the same issue. It seems tcpLookupIP goes to dns server directly and bypasses /etc/hosts. IMO this is wrong because you should respect resolv.conf in host lookups. This behavior makes Consul to fail connecting other agents in Azure environment. (In our case azure FQDN's are resolved to public IP address from the DNS server, but internal IPs are saved under /etc/hosts. When Consul resolves FQDN from DNS, it gets a public IP where it is firewalled and Consul is not bound to. Hence, Consul fails to join.)

Is there a true benefit from performingtcpLookupIP instead of net.LookupIP? Can we simply ignore that logic and perform go's net package?

slackpad commented 7 years ago

@kaskavalci ok this makes sense now. We want to keep the behavior of using TCP to get the largest possible list of hosts, but you are right that it breaks /etc/hosts. I think the best thing here would be to use Go's lookup and then tcpLookupIP and then merge + dedup the lists.

kaskavalci commented 7 years ago

@slackpad hmm, wouldn't that include multiple IP addresses for the same host? Assume the following /etc/hosts file:

127.0.0.1 google.com

We expect loopback address when we use go's lookup only but tcpLookupIP will return google's address too which will cause Join errors.

slackpad commented 7 years ago

That's true for that example, though you'd get both addresses so the join would still work. Maybe we just need a way to turn off this TCP behavior.

kaskavalci commented 7 years ago

Yes Join will work but with error messages because of this line https://github.com/hashicorp/memberlist/blob/master/memberlist.go#L190 . Maybe errors for the same host will not be printed as long as one working IP is found? Or just go back to Go implementation.

kaskavalci commented 7 years ago

Hi @slackpad , are you OK with using only go implementation? I can send a PR for that as well.

slackpad commented 7 years ago

We had added the TCP feature in response to folks who needed the full list of severs to join, so I don't think we want to take that away. I think if we changed the code in resolveAddr() to skip this clause - https://github.com/hashicorp/memberlist/blob/master/memberlist.go#L308-L310, and then add a dedup pass it will work. It's ok if there are join errors as long as any of the joins worked, so the pathological Google example should still be ok.

kaskavalci commented 7 years ago

Sounds OK to me. Is there a ETA for this?

AlexLov commented 7 years ago

Hi, Any news about fix the issue?

kaskavalci commented 7 years ago

I did the following change myself for an easy fix https://github.com/kaskavalci/memberlist. confirmed to work.