Open jcmcote opened 6 years ago
Related to #1765
@jcmcote , what docker CE version do you use ? I dont see version information here.
I am using 18.02 to reproduce this issue. When I try to reproduce the issue I see below issue. failed to create service x_serva: Error response from daemon: network x_mynet not found Creating service x_serva I think script needs modification. There is no delay between " docker stack down x" and "docker stack deploy -c docker-stack.yml x" . in general we need to wait until all cleanup is done when you want to redeploy again . log messages like below indicates clean up takes time. "eb 20 14:54:06 ELANGO-CE18-2-ubuntu-0 dockerd[21841]: time="2018-02-20T14:54:06.877864376-08:00" level=debug msg="Sending kill signal 15 to container 70060435bcaa63195e5b36051eee7da01c7005676832d9f6747c219acdf08f43" Feb 20 14:54:08 ELANGO-CE18-2-ubuntu-0 dockerd[21841]: time="2018-02-20T14:54".
@mavenugo mentioned in some #issue on how doing right way of script will avoid these kind of issues. I am trying to dig old issues and trying to find it out. Will update again soon.
@selansen I'm using docker version 18.02.0-ce. The latest version used by docker-machine with driver virtual box.
When I deploy there is an error saying the network is not yet created. That's fine. I should fail if it can't deploy just yet (after a tear down). However it will eventually deploy with no errors and you'll be in a state where the overlay networks are not cleaned up correctly.
I have reproduced this issue by aggressively deploying (not waiting for stack to come down) but my hunch is that this race condition issue is what we've been experiencing occasionally. Sometimes after an update to our stack (some services or network are changed) we get into a situation where some services can't ping or resolve each other's IP addresses.
I'm hoping someone will be able to use this scenario to explore potential race conditions in the docker network code that might show up occasionally under normal (less aggressive situations).
The point is when we deploy aggressively after a tear down it reports an error which again is fine. But then the system thinks all is ok and the deploy returns successful but leaving the overlay network in an inconsistent state. Why does the system report a successful deploy if it's not ready to deploy ?
May I know how long or how many iteration does it take for you to get into this state?
I have been running the same script for almost 45 mins, I am still able to ping between two containers.
it does not take too long (about 10min). But you have to monitor the release and stop the script as soon as there is a release missing. If you don't the script will bring things down again and put things up again.
However if you stop when you see a missing release of the overlay network. Then you'll notice you can't ping and will never be able to (the 2 nodes will not have the same overlay network id)
I've modified the up-and-down script. It now counts the number of releases in the manager and the worker. If the counts are not equal it will stop.
I'm at 22 iterations and it has not happened yet... It was much easier to reproduce a couple of days ago. I'll keep at it...
Also I added an init-swarm script which include the steps I use to create my 2 node swarm cluster.
Having this issue right now. Tried restarting docker, created new swarm, re-created the n/w still the issue exists.
Using docker version - docker --version Docker version 17.12.0-ce, build c97c6d6
OS- ubuntu 16.04 ` "Error": "subnet sandbox join failed for \"10.0.0.0/24\": error creating vxlan interface: file exists",
` Even can't remove a netns file. rm: cannot remove '/var/run/docker/netns/1-bbosggv6eg': Device or resource busy
Following these steps you can reproduce the issues in a matter of minutes. All you need is to bring up a cluster of 2 nodes
create a manager node
docker-machine create --driver virtualbox manager docker-machine ssh manager
add debug setting
echo '{ "debug": true }' > /etc/docker/daemon.json
get dockerd to reload the config
kill -HUP $(pidof dockerd)
check log for releasing of overlay network
tail -f /var/log/docker.log | grep 'releasing IPv4 pools'
start another terminal and do the steps above for a worker node
start another terminal init swarm manager
eval $(docker-machine env manager) docker swarm init --advertise-addr 192.168.99.103
make the worker join the swarm
eval $(docker-machine env worker) docker swarm join --token SWMTKN-1-2duh1guir5ywynuyz2p4w2 192.168.99.103:2377
you should now have a 2 node cluster
eval $(docker-machine env manager) docker node ls
run this until the worker log inidicate it did not release the overlay network as it should
./up-and-down.sh
Monitor the nodes dockerd logs
tail -f /var/log/docker.log | grep 'releasing IPv4 pools'
You'll notice both nodes release the overlay network but sometimes (after a few cycles) the worker node does not release the overlay network and then your in a state where both nodes do not use the same overlay network id. At this point the services are unable to ping each other.
Files needed
up-and-down.sh script brings up and down the stack ping.sh used to ping other service in the overlay network Dockerfile create an image and put the ping.sh script into it docker-stack.yml services to deploy to the swarm
files.tar.gz