Closed omerh closed 6 years ago
It seems that the order: start-first
in the stack deploy causes the issue.
I ran a simple update
deploy@swarm-as-prod-01:~/swarm$ sudo docker service update production_bb_service --image service:v1.192 --with-registry-auth --detach=false
production_bb-service
overall progress: rolling back update: 2 out of 3 tasks
1/3: Failed joining production_bb_default-endpoint to sandbox production_bb_def…
2/3: running [> ]
3/3: running [> ]
rollback: update rolled back due to failure or early termination of task 091qpjl7vq7vukfgrpfl5b2rk
service rollback paused: update paused due to failure or early termination of task px7jm8rdtpg0f3g63nyf6wmml
After removing the order: start-first
and defaulting to stop-first stack deploy and docker service update
started to work
ping @aaronlehmann @dnephin PTAL
The names in the error message don't seem to match the names in your Compose file, I guess you changed them?
Seems like more of an issue with networking/swarmkit, don't think it's related to stacks.
Yes, deleted them. You can ignore them. And, yes. Not related to stack.
It happens to a service that update config order was set to start-first on stack deploy and on service update.
Hi, I'm seeing the same issue when not using order: start-first
. It seems like this could be related to something truncating names (specifically something truncating a thing it thinks is an ID to 12 characters, but it's not actually an ID and so we're getting collisions). In the initial report above, note the error message:
"Failed joining staging-endpoint to sandbox staging_backend_sbox: container staging_backend_sbox: endpoint create on GW Network failed: endpoint with name gateway_staging_back already exists in network docker_gwbridge"
It looks like "staging_backend_sbox" was truncated to "staging_back" (12 characters). In my case when I do a "network inspect docker_gwbridge" I end up with some similarly truncated results in 17.10 (the 12 characters being prod-swarm_d
and prod-swarm_t
in the output below). This is causing tasks to throw errors for me similar to the one above when there are other networks (containers?) with similar names... I don't have a full example of a config and error message at this point that's causing problems because we've rolled back to 17.09 (which does not have those truncated names, it only seems to show the "ingress-sbox" as a non-ID container name). Hopefully this is helpful enough for folks who understand more about how these names are generated, and what may have changed between 17.09 and 17.10, to investigate?
[
{
"Name": "docker_gwbridge",
"Id": "c9edd6bab9f1f74be2bf77b36a5b5f0cb0da287ad6493667b04eb4944a83a221"
,
"Created": "2017-10-05T21:37:13.969399402Z",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.18.0.0/16",
"Gateway": "172.18.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"2b4ec0cf5755e095b69e250540cf98d702fdbabcbbf245130fc313bc23ea731b":
{
"Name": "gateway_2b4ec0cf5755",
"EndpointID": "426e729258f2f1180a70f0bfe8dd087b02d6f2e5db501e0a5
739fa24d9184984",
"MacAddress": "02:42:ac:12:00:08",
"IPv4Address": "172.18.0.8/16",
"IPv6Address": ""
},
"3e26fadfbec8b336b7e3863945243b27d6d1557877a0cba3483ea5aff56225dc":
{
"Name": "gateway_3e26fadfbec8",
"EndpointID": "7a29b4238693e3ee40c74185e003c55c5a1535d753ddb1009
e3af682c8aa5570",
"MacAddress": "02:42:ac:12:00:05",
"IPv4Address": "172.18.0.5/16",
"IPv6Address": ""
},
"8b082b47dd6cf5eb87aacc8320bb67fca2e2da8f6faeb303831bdee7d00c384c": {
"Name": "gateway_8b082b47dd6c",
"EndpointID": "87264c87a6825b8b5832879d3084cc21cbcab8f9fc9320ff5b97788c65a41d9a",
"MacAddress": "02:42:ac:12:00:07",
"IPv4Address": "172.18.0.7/16",
"IPv6Address": ""
},
"dd051896127ba0cf4b8da85fc505e697b953e047f32ad733e22f945403f8e69e": {
"Name": "gateway_dd051896127b",
"EndpointID": "2008f75d6a3c8aa25e69386720842d020e1bbc285bfd873480c994660aa34b6b",
"MacAddress": "02:42:ac:12:00:04",
"IPv4Address": "172.18.0.4/16",
"IPv6Address": ""
},
"ingress-sbox": {
"Name": "gateway_ingress-sbox",
"EndpointID": "0929b42a8ccd906a197786142d6a3edefa818dcfd88394ca6b4ac87d748a83d3",
"MacAddress": "02:42:ac:12:00:02",
"IPv4Address": "172.18.0.2/16",
"IPv6Address": ""
},
"prod-swarm_default-sbox": {
"Name": "gateway_prod-swarm_d",
"EndpointID": "e6e84cd23010772e60cf7b0c33848c1e4cc0573200987fc0c5697a00d34d5bf6",
"MacAddress": "02:42:ac:12:00:03",
"IPv4Address": "172.18.0.3/16",
"IPv6Address": ""
},
"prod-swarm_traefikext-sbox": {
"Name": "gateway_prod-swarm_t",
"EndpointID": "5f02e77af1afde21aacd0f96ae2736ca11cfdcff121d057dd7171ca88ebe9057",
"MacAddress": "02:42:ac:12:00:06",
"IPv4Address": "172.18.0.6/16",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.bridge.enable_icc": "false",
"com.docker.network.bridge.enable_ip_masquerade": "true",
"com.docker.network.bridge.name": "docker_gwbridge"
},
"Labels": {}
}
]
Just thinking... might be related to the point that our containers run without STOPSIGNAL? the engine kills the container after grace failed so it restart the container, thus the stack deploy fails
@mcodd do you have a STOPSIGNAL in your Dockerfile ?
We do not have STOPSIGNAL in our Dockerfile... for the tasks failing to start, the error message really suggests something at the networking layer - i don't know enough swarm internals to understand why those -sbox containers are being spawned on the docker_gwbridge network, but their names are obviously colliding with similarly named things already on the network...
I see the gwEPlen constant being set to 12 at https://github.com/docker/libnetwork/blob/master/default_gateway.go, and i think that's what's being used to come up with the truncated Name's in my network inspect above, but I also think there is an implicit assumption that these things are going to be named with UUID-type ID's and not the names that I'm using for my networks... Again, this seems to be a thing that is different in 17.10 because 17.09 doesn't show this issue at all in my environment.
I've been seeing this happening lately as well
docker version
Client:
Version: 17.10.0-ce
API version: 1.33
Go version: go1.8.3
Git commit: f4ffd25
Built: Tue Oct 17 19:05:05 2017
OS/Arch: linux/amd64
Server:
Version: 17.10.0-ce
API version: 1.33 (minimum version 1.12)
Go version: go1.8.3
Git commit: f4ffd25
Built: Tue Oct 17 19:03:46 2017
OS/Arch: linux/amd64
Experimental: true
and yeah, we're testing it using some >12chars network names:
docker network ls
NETWORK ID NAME DRIVER SCOPE
6ea857481951 bridge bridge local
a7a12hv1g9gi com-somethin-network-256187400872361471 overlay swarm
cpljozi32k4k com-somethin-network-256203857689669114 overlay swarm
qozdck9ltsbg com-somethin-network-256214232243926931 overlay swarm
lu2et741iimz com-somethin-network-256216521800418289 overlay swarm
k8a6seby57qh com-somethin-network-256219163851389904 overlay swarm
ruz80ohxo7zz com-somethin-network-256225798573509330 overlay swarm
0yx5uf8dt143 com-somethin-network-256226274643705108 overlay swarm
kx04bzu88ezm com-somethin-network-256227170786275798 overlay swarm
mdbw2xsyhk0j com-somethin-network-256227215782655894 overlay swarm
v447udw8f6gw com-somethin-network-256229288461242460 overlay swarm
o6ruzir8f3hg com-somethin-network-256232319955037343 overlay swarm
w1r347rjcb3w com-somethin-network-256232538421673948 overlay swarm
wlycyfhwo593 com-somethin-network-256233842189380880 overlay swarm
5lu6w9l7n5kn com-somethin-network-256234462216996878 overlay swarm
fwu9mkc043x0 com-somethin-network-256239854449434693 overlay swarm
2ca3cdb78dc7 docker_gwbridge bridge local
44c3f7593757 host host local
r6ligfm7chw9 ingress overlay swarm
ccb59f75d8a2 none null local
Docker 17.11
Rejected:
Failed joining xxxxxx-dev-yyyyyy-indexer-elastic-endpoint to sandbox xxxxxx-dev-yyyyyy-indexer-elastic-sbox: container xxxxxx-dev-yyyyyy-indexer-elastic-sbox: endpoint create on GW Network failed: endpoint with name gateway_xxxxxx-dev-s already exists in network docker_gwbridge
I try to rename my networks to shorter as a workaround, but it is not easy as I use constants to contanenate the network names and it seems i have only 1 character left or else I have to restructure the whole deploy system...
Additional information: this problem is not always occuring when the truncated network names are colliding. Recreating the network helps.
I'm curious if any progress has been made on this issue - i see that #35310 could have been related but @DBLaci comment suggests the truncation at 12 characters is still occurring - this is currently keeping us from moving off of 17.09 to versions that could fix some other network-related issues we're seeing... trying to understand the nature of the problem a little better, it seems like the containers on the docker_gwbridge network have been given keys in 17.10+ that are no longer those long UUID-type strings but instead are names related to services, truncated at 12 characters? i'm surprised that there aren't more folks having this issue and i'm wondering if there's something in my environment that's unique that means i'd be seeing this issue when others don't? i can't imagine that in general everyone is using really short service names?
ping @pradipd @msabansal @madhanrm @mavenugo PTAL ^^
I'm also running into this issue with 17.10.0-ce. Is there a workaround? I haven't been able to figure out why it affects some services/networks and not others.
Failed joining wte-93-faas_perimeter-endpoint to sandbox wte-93-faas_perimeter-sbox: container wte-93-faas_perimeter-sbox: endpoint create on GW Network failed: endpoint with name gateway_wte-93-faas_ already exists in network docker_gwbridge
ping @pradipd has this been addressed?
I think so, but, I'm not 100% certain because I can't tell if there are multiple issues in this thread or all issues are the same. The issue with the truncation to 12 char that @mcodd describes above (https://github.com/moby/moby/issues/35281#issuecomment-339984017), should be addressed by commit https://github.com/moby/moby/pull/35422 . My understanding is that the fix should be in 17.12.
If there are other issues beyond the truncation, then we will have to investigate those issues.
I am working on this issue. I tested 17.11 and reproduced the issue. I started working on the issue. if it is already fixed at 17.12, I will move to different issue.
If you could give me the repro steps, I'll validate it is fixed in 17.12
I have the setup now. Will try it with 17.12 and let you know.
I have tested same scenario with image 17.12 and the problem doesnt happen in 17.12. Looks like this issue was caused due to regression and @pradipd commit fixed the issue in 17.12. We are good to close this issue. Below output from 17.12 image shows extra sandbox creation which caused the issue (17.10 onwards) is not present in 17.12.
docker@ELANGO-CE-EDGE-ubuntu-0:~$ docker network inspect -v docker_gwbridge [ { "Name": "docker_gwbridge", "Id": "ce7e6941145345011d1c07f776b9657430897cb16c26ee16c84aa5d045bf5f51", "Created": "2017-12-08T10:04:00.39018803-08:00", "Scope": "local", "Driver": "bridge", "EnableIPv6": false, "IPAM": { "Driver": "default", "Options": null, "Config": [ { "Subnet": "172.18.0.0/16", "Gateway": "172.18.0.1" } ] }, "Internal": false, "Attachable": false, "Ingress": false, "ConfigFrom": { "Network": "" }, "ConfigOnly": false, "Containers": { "2deb67dcd42cfe7464cb0f8d2e66ec3d0a70053aaa1e33f718a8ac89c1a4a0f3": { "Name": "gateway_2deb67dcd42c", "EndpointID": "f88e2379020d38da16435120c28e141feb74187ccd4c3f87cf6ad26dba5afdaf", "MacAddress": "02:42:ac:12:00:04", "IPv4Address": "172.18.0.4/16", "IPv6Address": "" }, "82ee91db89bfd0fd35f1b1cefc381517ab0683d34ad83e6da736ff0b7015b0e0": { "Name": "gateway_82ee91db89bf", "EndpointID": "36fc341c5433c199d91b4163adb4a3e4288de609e8d9885df72028817a810fc0", "MacAddress": "02:42:ac:12:00:05", "IPv4Address": "172.18.0.5/16", "IPv6Address": "" }, "96032523a574afe4c51effdfeb92fcacfffc77ad5bc5099c253edd14ab284de4": { "Name": "gateway_96032523a574", "EndpointID": "2637a5b1222296967d609fb241d1ebe65cfc09826490829665cb2328dde6e316", "MacAddress": "02:42:ac:12:00:06", "IPv4Address": "172.18.0.6/16", "IPv6Address": "" }, "f7bafd5b6e5487a34b248e69413771d0cf8b4102316a41b0e25139e766163843": { "Name": "gateway_f7bafd5b6e54", "EndpointID": "fa90f22b9609a4088882739cd4513715ae74ec47d44f342958fc2b2dd7160b34", "MacAddress": "02:42:ac:12:00:03", "IPv4Address": "172.18.0.3/16", "IPv6Address": "" }, "ingress-sbox": { "Name": "gateway_ingress-sbox", "EndpointID": "d45b2d18dea148be00ad416e3ab64fc690df4a97b6ce240076759226f6924d6d", "MacAddress": "02:42:ac:12:00:02", "IPv4Address": "172.18.0.2/16", "IPv6Address": "" } }, "Options": { "com.docker.network.bridge.enable_icc": "false", "com.docker.network.bridge.enable_ip_masquerade": "true", "com.docker.network.bridge.name": "docker_gwbridge" }, "Labels": {} }
Thanks for confirming the fix!
I am with docker 17-10-ce doing stack deploy.
our stack deploy:
sometimes with no actual way to reproduce doing stack deploy is marked as failure with this error message:
On swarm manager no errors in logs. This is the info:
On the worker instance node this is the only this I see
This is the info on the worker: