Open subchen opened 8 years ago
Hi @subchen the DNS resolution is being done down in here - https://github.com/hashicorp/memberlist/blob/master/memberlist.go#L218-L234. Is it possible that host has multiple IPs registered with DNS and maybe Go is picking a different one to try?
Hi @slackpad, I only added some records in /etc/hosts
to resolve the domain/FQDN.
192.168.1.21 node1
192.168.1.22 node2
I don't know whether the DNS resolve library only uses real DNS and skips the /etc/hosts
.
Will have to dig into Go a little bit to see what it does.
@slackpad We experience the same issue. It seems tcpLookupIP
goes to dns server directly and bypasses /etc/hosts
. IMO this is wrong because you should respect resolv.conf
in host lookups. This behavior makes Consul to fail connecting other agents in Azure environment. (In our case azure FQDN's are resolved to public IP address from the DNS server, but internal IPs are saved under /etc/hosts
. When Consul resolves FQDN from DNS, it gets a public IP where it is firewalled and Consul is not bound to. Hence, Consul fails to join.)
Is there a true benefit from performingtcpLookupIP
instead of net.LookupIP
? Can we simply ignore that logic and perform go's net package?
@kaskavalci ok this makes sense now. We want to keep the behavior of using TCP to get the largest possible list of hosts, but you are right that it breaks /etc/hosts
. I think the best thing here would be to use Go's lookup and then tcpLookupIP
and then merge + dedup the lists.
@slackpad hmm, wouldn't that include multiple IP addresses for the same host? Assume the following /etc/hosts
file:
127.0.0.1 google.com
We expect loopback address when we use go's lookup only but tcpLookupIP
will return google's address too which will cause Join errors.
That's true for that example, though you'd get both addresses so the join would still work. Maybe we just need a way to turn off this TCP behavior.
Yes Join will work but with error messages because of this line https://github.com/hashicorp/memberlist/blob/master/memberlist.go#L190 . Maybe errors for the same host will not be printed as long as one working IP is found? Or just go back to Go implementation.
Hi @slackpad , are you OK with using only go implementation? I can send a PR for that as well.
We had added the TCP feature in response to folks who needed the full list of severs to join, so I don't think we want to take that away. I think if we changed the code in resolveAddr()
to skip this clause - https://github.com/hashicorp/memberlist/blob/master/memberlist.go#L308-L310, and then add a dedup pass it will work. It's ok if there are join errors as long as any of the joins worked, so the pathological Google example should still be ok.
Sounds OK to me. Is there a ETA for this?
Hi, Any news about fix the issue?
I did the following change myself for an easy fix https://github.com/kaskavalci/memberlist. confirmed to work.
I use following CLI to start a consul server to join a exist cluster.
errorlogs (the resolved ip address is wrong)
That is failed to join the node2 to node1 cluster. If I changes node1 to 192.168.1.21, that does work.
My consul version is 0.5.2
Also,
consul join <FQDN>
does not work.