Closed brettdh closed 8 years ago
Happens to us pretty random (or not in known/reproducible way) also. docker version:
Client:
Version: 1.10.1
API version: 1.22
Go version: go1.5.3
Git commit: 9e83765
Built: Thu Feb 11 19:27:08 2016
OS/Arch: linux/amd64
Server:
Version: 1.10.1
API version: 1.22
Go version: go1.5.3
Git commit: 9e83765
Built: Thu Feb 11 19:27:08 2016
OS/Arch: linux/amd64
Docker info from third node of our swarm cluster, only one that have problem atm.
Containers: 12
Running: 2
Paused: 0
Stopped: 10
Images: 15
Server Version: 1.10.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 297
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: overlay host bridge null
Kernel Version: 4.2.0-27-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67 GiB
Name: host3
ID: V3UE:GFTF:TZ2B:2IOU:Q7YW:FD6A:PXOK:NHUT:V6DT:ZT6T:5H4W:22D2
WARNING: No swap limit support
Cluster store: etcd://<host1>:2379,<host2>:2379,<host3>:2379/_pa
Cluster advertise:
EDIT : I did just copypaste from #751, but Ubuntu does not have Sys V init script for iptables, so I just restarted docker and problem is gone now :
[root@<host3> ~]# service docker stop; service iptables restart; service docker start
docker stop/waiting
iptables: unrecognized service
docker start/running, process 13824
Saw this again today, after destroying and re-creating my --driver generic
docker machines. No meaningful changes in the docker info
output; just UpdatedAt
and Containers
counts.
Oh, it now says Server Version: swarm/1.1.2
(Client and server docker version
still 1.10.0).
Restarting docker didn't work for me like it did for @arteal, so I ended up rebooting all the hosts again.
Happened to us again, restarting docker didn't help, stopping docker, flushing iptables and starting docker did not help. Had to reboot that machine
Same as @arteal for me Unable to identify the root cause so unable to reproduce on purpose. Seems to happen randomly.
I'll try to get more info next time...
docker version
:
Client:
Version: 1.10.3
API version: 1.22
Go version: go1.5.3
Git commit: 20f81dd
Built: Thu Mar 10 15:54:52 2016
OS/Arch: linux/amd64
Server:
Version: 1.10.3
API version: 1.22
Go version: go1.5.3
Git commit: 20f81dd
Built: Thu Mar 10 15:54:52 2016
OS/Arch: linux/amd64
docker info
:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 9
Server Version: 1.10.3
Storage Driver: aufs
Root Dir: /data/.graph/var/lib/releases/20160316_1458121754/aufs
Backing Filesystem: extfs
Dirs: 210
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: overlay null host bridge
Kernel Version: 3.16.0-67-generic
Operating System: Ubuntu 14.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.421 GiB
Name: xxxxxxx
ID: YPJJ:UE7N:FYIZ:AOXA:EXVL:7B7I:67U7:YUPK:NOZB:WMLJ:J4YH:GWOV
WARNING: No swap limit support
Labels:
project=xxxxxx
env=test
type=bastion
Cluster store: consul://x.x.x.x:8502/network
Cluster advertise: z.z.z.z:2375
uname -a
:
Linux xxxxxxxx 3.16.0-67-generic #87~14.04.1-Ubuntu SMP Fri Mar 11 00:26:02 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Environment details (AWS, VirtualBox, physical, etc.):
Physical
I just experienced this issue.
First i got: 500 Internal Server Error: subnet sandbox join failed for "10.0.0.0/24": vxlan interface creation failed for subnet "10.0.0.0/24": failed in prefunc: failed to set namespace on link "vx-000104-52662": invalid argument
Then after a few tries it switched to saying: 500 Internal Server Error: subnet sandbox join failed for "10.0.4.0/24": error creating vxlan interface: file exists
Doing a "service docker stop; service docker start" helped.
Containers: 34 Running: 27 Paused: 0 Stopped: 7 Images: 35 Server Version: swarm/1.1.3 Role: replica Primary: 10.42.0.232:4000 Strategy: spread Filters: health, port, dependency, affinity, constraint Nodes: 3 swarm04: 10.42.0.250:2375 └ Status: Healthy └ Containers: 12 └ Reserved CPUs: 0 / 8 └ Reserved Memory: 4 GiB / 32.91 GiB └ Labels: executiondriver=native-0.2, kernelversion=4.4.4-std-3, operatingsystem=Ubuntu 14.04.4 LTS, storagedriver=btrfs └ Error: (none) └ UpdatedAt: 2016-03-20T19:04:44Z swarm05: 10.42.0.232:2375 └ Status: Healthy └ Containers: 11 └ Reserved CPUs: 0 / 8 └ Reserved Memory: 4 GiB / 32.91 GiB └ Labels: executiondriver=native-0.2, kernelversion=4.4.4-std-3, operatingsystem=Ubuntu 14.04.4 LTS, storagedriver=btrfs └ Error: (none) └ UpdatedAt: 2016-03-20T19:05:10Z swarm06: 10.42.0.148:2375 └ Status: Healthy └ Containers: 11 └ Reserved CPUs: 0 / 8 └ Reserved Memory: 4 GiB / 16.4 GiB └ Labels: executiondriver=native-0.2, kernelversion=4.4.4-std-3, operatingsystem=Ubuntu 14.04.4 LTS, storagedriver=btrfs └ Error: (none) └ UpdatedAt: 2016-03-20T19:04:40Z Plugins: Volume: Network: Kernel Version: 4.4.4-std-3 Operating System: linux Architecture: amd64 CPUs: 24 Total Memory: 82.22 GiB Name: swarm04
Client: Version: 1.10.3 API version: 1.22 Go version: go1.5.3 Git commit: 20f81dd Built: Thu Mar 10 15:54:52 2016 OS/Arch: linux/amd64
Server: Version: swarm/1.1.3 API version: 1.22 Go version: go1.5.3 Git commit: 7e9c6bd Built: Wed Mar 2 00:15:12 UTC 2016 OS/Arch: linux/amd64
I keep having this issue. Now even restarting the docker daemon doesn't help. Please let me know what more information you need in order to find a solution for this.
@larsla invalid argument
typically suggests its a kernel specific issue. I see that you are using Kernel Version: 4.4.4-std-3
. Most likely its to do with that. Please share more details on your environment.
For those hitting the file exists
problem, it would be of great help if someone can share some reproduction steps.
Closed via https://github.com/docker/libnetwork/pull/1065 and is vendored into docker/docker.
Can someone try the docker/docker master and confirm the fix ?
@mavenugo Unfortunately, once the problem has manifested, it's too late. I saw this today after upgrading my swarm hosts to docker 1.10.3, and if I'm reading the PR correctly, it prevents the problem from occurring, but it won't help for me to upgrade docker and try again. So I ended up rebooting the hosts again.
I greatly look forward to the release of 1.11.0, though, so this can hopefully be in the past. :-)
@brettdh I noticed the same. Only a reboot of the host fixes it but this is not an option for production. The process can take hours in an enterprise IT environment where the vendor (us) have no direct control over the hypervisor.
Same here.
Containers: 55
Running: 0
Paused: 0
Stopped: 55
Images: 55
Server Version: 1.11.0
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 282
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay bridge null host
Kernel Version: 4.2.0-34-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.792 GiB
Name: xxxxx
ID: 2R52:ZXPR:UD56:KTME:JNOW:2X76:Y6LM:3UCU:4PH6:7ONV:PT6X:ULIL
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Username: mvarga
Registry: https://index.docker.io/v1/
Cluster store: consul://xxxx:8500
Cluster advertise: xxxx:2376
This is still occurring on 1.11.1. (The file exists
problem, specifically)
And once again today.
ping @mavenugo
@thaJeztah @mavenugo This should be reopened as it can be reproduced using the below script. The idea is to simulate ungraceful docker daemon shutdowns when there are containers connected to an overlay network having a restart policy.
#!/usr/bin/env groovy
Thread.start {
watchOutput("error")
}
deleteContainers()
c = "sudo docker network create -d overlay test".execute().text
print c
for (int i = 0; i < 100; i++) {
def out = "sudo docker run -td --restart unless-stopped --net test --name t$i ubuntu:14.04.2 bash".execute().text
print out
}
def resp = 'sudo docker ps'.execute().pipeTo("wc -l".execute()).text.trim()
assert resp == "101"
while (true) {
println 'killing the daemon'
killDockerDaemon()
sleep(5000)
resp = 'sudo docker ps'.execute().pipeTo("wc -l".execute()).text.trim()
assert resp == "101"
}
//----------
def killDockerDaemon() {
def c = ['bash', '-c', "sudo kill -9 \$(sudo service docker status|cut -d ' ' -f4)"].execute().text
println c
}
def deleteContainers() {
def c = ['bash', '-c', "sudo docker rm -f \$(sudo docker ps -a -q )"].execute().text
println c
}
def watchOutput(text, file = "/var/log/upstart/docker.log") {
Process p = "sudo tail -f $file".execute().pipeTo(["grep", "$text"].execute())
p.in.eachLine { line ->
println line
}
}
My env is :
$ sudo docker info
Containers: 100
Running: 0
Paused: 0
Stopped: 100
Images: 377
Server Version: 1.11.2
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay bridge null host
Kernel Version: 4.3.0-040300-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.16 GiB
Name: seb
ID: 3IIK:AWIX:PLOR:BPQ4:XNEL:SSXQ:2GUL:VEKX:OVCQ:SCCX:MN2U:DTWH
Docker Root Dir: /home/seb/hgdata/deployments/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Cluster store: consul://localhost:8500
Cluster advertise: 192.168.123.209:2375
$ sudo docker version
Client:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 21:47:50 2016
OS/Arch: linux/amd64
Server:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 21:47:50 2016
OS/Arch: linux/amd64
@sebi-hgdata Can you try this same script with 1.12-rc2 because we have added fixes in it which should fix it. If it is still present then atleast there will be some opportunity to fix it completely before the 1.12 release is out.
@mrjana seems like the issue does not reproduce with 1.12-rc2 all containers restart properly during testing ... but I see a lot of errors like
time="2016-06-23T23:06:54.858306166+03:00" level=error msg="Failed to start container c0ecaca06cb75c2ff5e0950f5554053b593752f946201a5104c2e0d6c07ba44d: could not add veth pair inside the network sandbox: error setting interface \"vethc1dc8a1\" master to \"br\": Device does not exist"
after letting the script kill the daemon a couple of times, killing the script, stopping the docker daemon and starting it up again... after this no container is restarted because they failed with the above error. I can create another issue if it's necessary
@sebi-hgdata Thanks for testing. Yes please create another issue for the new bug you are seeing. We will make sure it is fixed before 1.12 release.
@mrjana added #1290
I've just experienced it in a Swarm cluster running 1.12.3.
subnet sandbox join failed for \"10.0.2.0/24\": error creating vxlan interface: file exists
It has been produced when I did:
docker service rm logspout
docker service create --mode global --name logspout ...
Still seeing this on swarm 1.2.5 + Docker 12.2-rc1
@mavenugo we're also seeing this quite often in PWD (play-with-docker) using 1.13rc1
Had the file exists error with
Client:
Version: 1.13.0-rc2
API version: 1.25
Go version: go1.7.3
Git commit: 1f9b3ef
Built: Wed Nov 23 06:17:45 2016
OS/Arch: linux/amd64
Server:
Version: 1.13.0-rc2
API version: 1.25
Minimum API version: 1.12
Go version: go1.7.3
Git commit: 1f9b3ef
Built: Wed Nov 23 06:17:45 2016
OS/Arch: linux/amd64
Experimental: false
with docker engine in swarm mode. After restarting docker service everything worked.
@Michael-Hamburger / all this will be fixed by https://github.com/docker/libnetwork/pull/1574
I've applied this patch manually and haven't had any issues so far.
I'm running into this problem on my 1.12.3 swarm. Is there a workaround for when it happens? I've tried removing and adding the problematic network(s) without success. Is there something short of a full restart of the swarm hosts that can alleviate this?
Is there something short of a full restart of the swarm hosts that can alleviate this?
Yes, if you restart the daemon the problem should be gone. @mavenugo. But if you keep removing / creating networks you'll come across with it again
I saw this now.
Docker version 17.05.0-ce, build 89658be
Restarting the docker daemon is not fixing this.
"starting container failed: subnet sandbox join failed for "10.0.2.0/24": error creating vxlan interface: file exists"
@alexanderkjeldaas can you open a new issue with details, possibly it's different from the one that's resolved by https://github.com/docker/libnetwork/pull/1574
I took the liberty of opening a new thread as I, too, am encountering this issue on 17.05-ce (and have on earlier releases).
See #1765 for info.
@arteal we rebooted our machine after that machine not booted 😆
Same issue: subnet sandbox join failed for "10.255.0.0/16": error creating vxl...
a similar problem is reproduced now june 2019
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 18.09.6
Storage Driver: zfs
Zpool: rpool
Zpool Health: ONLINE
Parent Dataset: rpool/ROOT/pve-1
Space Used By Parent: 1047266021376
Space Available: 227269107712
Parent Quota: no
Compression: on
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: oo0dscwkoviwtnvocttkg9932
Is Manager: true
ClusterID: 3gfm66fjmcvn6q9k2mssqwcip
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 172.16.1.49
Manager Addresses:
172.16.1.49:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.18-9-pve
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.31GiB
Name: centroit-pve1
ID: 6INQ:DFKN:3V2I:XBIZ:VDMR:7HJH:VKEY:OMNY:7YIT:YFK7:KA74:MZLO
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
# iptables-save
# Generated by iptables-save v1.6.0 on Wed Jun 12 16:21:32 2019
*mangle
:PREROUTING ACCEPT [525538:905157812]
:INPUT ACCEPT [492687:891964017]
:FORWARD ACCEPT [162666:115681384]
:OUTPUT ACCEPT [312922:803390219]
:POSTROUTING ACCEPT [475588:919071603]
COMMIT
# Completed on Wed Jun 12 16:21:32 2019
# Generated by iptables-save v1.6.0 on Wed Jun 12 16:21:32 2019
*nat
:PREROUTING ACCEPT [1659:93104]
:INPUT ACCEPT [393:22550]
:OUTPUT ACCEPT [3862:570483]
:POSTROUTING ACCEPT [5128:641037]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
COMMIT
# Completed on Wed Jun 12 16:21:32 2019
# Generated by iptables-save v1.6.0 on Wed Jun 12 16:21:32 2019
*filter
:INPUT ACCEPT [326852:574844648]
:FORWARD ACCEPT [104813:74490157]
:OUTPUT ACCEPT [209763:530418909]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
COMMIT
# Completed on Wed Jun 12 16:21:32 2019
Still happening on docker swarm at ubuntu 18.04 Solution that worked for me was to remove stack and redeploy it
docker stack rm jenkins
docker stack deploy -c docker-compose.yml jenkins
Debug Mode: false
Server:
Containers: 50
Running: 30
Paused: 0
Stopped: 20
Images: 114
Server Version: 19.03.5
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: sywquid85sgsma3puf4gw6u68
Is Manager: true
ClusterID: zh6x2htbsuevss97ytz2kbkn5
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: true
Root Rotation In Progress: false
Node Address: 51.77.42.145
Manager Addresses:
51.77.42.145:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: b34a5c8af56e510852c35414db4c1f4fa6172339
runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.0-74-generic
Operating System: Ubuntu 18.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.2GiB
Name: ns3145583
ID: O3MT:IN6V:IFUN:MMQ4:77FH:H7A2:CUUP:3ZIU:3FSS:JKBW:JADU:SKQ3
Docker Root Dir: /home/dockerd
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support```
This error still occurs occasionally in Docker 20.x, VM restart typically required.
(Similar, but closed, issues: #562, #751)
Very occasionally, I see this error message when starting a container in my swarm:
This error persists until I reboot the docker hosts. A comment on #751 suggested that restarting iptables would suffice; I have not tried this yet. I also have tried the solution mentioned in #562 previously, and I believe that worked as well, but I cannot remember for sure.
docker version
:docker info
:Note: this was not captured when the error was occurring. If it happens again, I will comment with the info.
uname -a
:Darwin <hostname> 15.3.0 Darwin Kernel Version 15.3.0: Thu Dec 10 18:40:58 PST 2015; root:xnu-3248.30.4~1/RELEASE_X86_64 x86_64
Linux <hostname> 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Environment details (AWS, VirtualBox, physical, etc.):
How reproducible: