Open tonistiigi opened 8 years ago
@tonistiigi You launched the manager using the same state directory but changing the port, in this case you should use swarmctl manager rm
to remove the reference to the old manager (with the old port) to proceed further with joining new members. Otherwise you'll end up with no leader as when you join back with a different port, it will send vote to what seems to be another member, leaving the cluster to be stuck.
/cc @aaronlehmann
We can maybe detect if we are restarting a manager with a different address while still pointing to the same state directory and handle that case by forcing a ConfChange remove on the old address/member registered? WDYT?
We can maybe detect if we are restarting a manager with a different address while still pointing to the same state directory and handle that case by forcing a ConfChange remove on the old address/member registered? WDYT?
I don't think we should ever force configuration changes except when the user explicitly asks us to with --force-new-cluster
. Maybe we could just record the original address/port used for that state directory, and refuse to start with a different one.
Managers shouldn't change addresses or ports, since the raft cluster needs to know where to reach them. It's possible to use hostnames instead of IPs though, which may be useful in some situations.
Managers shouldn't change addresses or ports, since the raft cluster needs to know where to reach them. It's possible to use hostnames instead of IPs though, which may be useful in some situations.
Most people will probably use the default 0.0.0.0
meaning that this happens automatically if the IP should change?
Most people will probably use the default 0.0.0.0 meaning that this happens automatically if the IP should change?
We print a warning recommending that people not use the default: https://github.com/docker/swarm-v2/blob/master/cmd/swarmd/manager.go#L28
There isn't really a good solution to this. If you are part of a raft cluster, other members need to know how to reach you. We make a best effort guess of the IP address if you don't specify one, but it's better to specify an IP or hostname that you know is stable. (And if you don't have a stable IP or hostname, you shouldn't be part of a multinode raft cluster anyway).
@aaronlehmann Is this still relevant given the listen/advertise improvements of 1.12?
Also, there's a second issue in this ticket: changing advertised address. Should we log an issue for that? Should we attempt a fix for 1.13 or is that low priority?
I think it is still relevant.
Changing the advertise address is important to support but nontrivial. It probably needs more discussion before we choose a milestone.
Related to #2198.
I recently had this issue with a cluster of Raspberry Pi's I wanted to move somewhat smoothly between networks. I found that adding iptables rules redirecting connections destined to the old manager addresses via NAT to the current addresses would get the swarm online, after which manager nodes can be removed and re-added one by one to update the raft addresses and such.
I kept fiddling around with this and came up with a script to do this automatically, and then proceeded to wrap the whole catastrophy in a Docker container.
It's far from perfect, but it should keep any cluster running on Debian, or a derivative like Raspbian, intact shortly after changing all the IP addresses.
In short, the script runs on all manager nodes, generates one SSH key per manager and distributes those to other nodes using services, relies on Avahi to find the new addresses, adds iptables NAT rules to get the swarm online, uses temporary services as distributed locks to re-join one by one, removes the iptables rules when they're no longer required, proceeds to re-invite any non-privileged workers, and then sits happily watching for swarm changes.
Feel free to fiddle around with it, draw inspiration or whatnot, and if you make improvements or other customizations, submit a pull request and I'll happily include it.
Oh, and there's no reason to tell me this is insane, or stupid. I know. I happened to have a hammer, and the problem looked like a nail.
EDIT: I was waffling on too long to include the links... https://github.com/b01t/swarm-glue https://hub.docker.com/r/b01t/swarm-glue
Stopping manager and starting it again with a different port still shows the old address in
swarmctl manager ls
. Joining a second manager to that node comes up as reachable but can't be used(probably because the connection is only one way).After switching back the address:
@abronan