Open iceman91176 opened 7 years ago
Hi @iceman91176 when you do the force-leave are you doing it by name? A common mistake is to use the address there but it requires the node name ceph-dc-1-01-osd-05
. That should work to kick it. I suspect that what's happening is that the errors are so high up that memberlist isn't registering this as a failed node, so since no node in the cluster can probe it never gets marked as suspect/failed, which would be a bug.
consul version
for both Client and ServerClient:
0.8.3
Server:0.8.3
consul info
for both Client and ServerClient:
Server:
Operating system and Environment details
RedHat EL 7.3
Description of the Issue (and unexpected/desired result)
We had to reinstall one of the servers which was running as consul client. We did not leave/force-leave the client. During installation we did configure a wrong bind address (IPv4) for the re-installed agent. Somehow the agent was able to connect to the IPv6-Consul-Servers anyway. Since then all of the cluster agents show this wrong cluster member with the state alive. But is isn't alive - all agents are complaining about not beeing able to connect to the IPv4-Address. We are not able to leave/force-leave that wrong host (bind address has been corrected in the mean-time). When trying a force leave on some other agent B the faulty agent is stuck in state leaving on agent B (but not on Agent C for example, where it is still alive) ,until a restart of the agent B. Then the faulty agent goes back to alive on Agent B.
As a workaround we assigned a different name/id to the faulty node. It was able to join the cluster without problems.
Since the state of the orphaned agent is 'alive' i 'don't think it will be reaped automatically. Any chance to remove it manually ?
Reproduction steps
Log Fragments or Link to gist
Memberlist - the first entry is wrong and definitive not alive ceph-dc-1-01-osd-05 10.60.0.36:8301 alive client 0.8.3 2 dc-witcom-cloud ceph-dc-1-01-osd-05-temp-02 [2aXY:1f08:900:1::ZZZ]:8301 alive client 0.8.3 2 dc-witcom-cloud
Log from same host 2017/07/11 23:04:54 [ERR] memberlist: Failed to send ping: write udp [2aXY:1f08:900:1::ZZY]:8301->10.60.0.36:8301: sendto: network is unreachable 2017/07/11 23:04:58 [ERR] memberlist: Push/Pull with ceph-dc-1-01-osd-05 failed: dial tcp 10.60.0.36:8301: getsockopt: connection refused 2017/07/11 23:05:03 [ERR] memberlist: Failed to send ping: write udp [2aXY:1f08:900:1::ZZY]:8301->10.60.0.36:8301: sendto: network is unreachable