moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.68k stars 18.65k forks source link

Stack deploy failed with rejected status endpoint create on GW Network failed: endpoint with name gateway_stack_name already exists in network docker_gwbridge #35281

Closed omerh closed 6 years ago

omerh commented 7 years ago

I am with docker 17-10-ce doing stack deploy.

our stack deploy:

version: '3.4'

services:
  poc-web:
    image: web-service-poc
    networks:
      - poc-web
    ports:
      - "8791:8791"
    deploy:
      replicas: 1
      placement:
        constraints: [engine.labels.staging == backend]
      restart_policy:
        condition: any
        delay: 5s
      update_config:
        parallelism: 1
        failure_action: rollback
        delay: 5s
        order: start-first
    healthcheck:
      test: ["CMD-SHELL", "wget -q -O - http://localhost:8791/healthcheck | grep OK"]
      interval: 30s
      timeout: 10s
      retries: 2
      start_period: 40s

networks:
  poc-web:

sometimes with no actual way to reproduce doing stack deploy is marked as failure with this error message:

vlyym3nv74nbfw2sw7y2s1vug    \_ staging_backend-service.1  
 service-:v1.192@sha256:2fca677757b295ae2d2a8f4b616ef8f07dfb5267a72abe3c9edf8834762e2feb   ip-10-30-2-39       Shutdown            Rejected 3 hours ago      "Failed joining staging-endpoint to sandbox staging_backend_sbox: container staging_backend_sbox: endpoint create on GW Network failed: endpoint with name gateway_staging_back already exists in network docker_gwbridge"

On swarm manager no errors in logs. This is the info:

Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:04:16 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:02:56 2017
 OS/Arch:      linux/amd64
 Experimental: false
Containers: 6
 Running: 3
 Paused: 0
 Stopped: 3
Images: 4
Server Version: 17.10.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: umt8pt2n5ofn46zbri9a5wvpu
 Is Manager: true
 ClusterID: uhzhyo7gk7donziklx84cuiyz
 Managers: 1
 Nodes: 22
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.30.1.88
 Manager Addresses:
  10.30.1.88:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1022-aws
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67GiB
Name: swarm-stg-01.naturalint.com
ID: X253:UUOZ:343J:UFOY:OO74:OO4W:WUES:MIXY:YGA3:VVY2:QISV:DEDP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

On the worker instance node this is the only this I see

Oct 24 10:28:38 ip-10-30-3-172 dockerd[9549]: time="2017-10-24T10:28:38.809205816Z" level=info msg="NetworkDB stats ip-10-30-3-172(702a7b3ef9fb) - netID:b3toaouf7cexem007vz21j2ti leaving:false netPeers:22 entries:92 Queue qLen:0 netMsg/s:0"
Oct 24 10:28:38 ip-10-30-3-172 dockerd[9549]: time="2017-10-24T10:28:38.809300345Z" level=info msg="NetworkDB stats ip-10-30-3-172(702a7b3ef9fb) - netID:zhrycnbdt53jofkfzac8iveub leaving:true netPeers:2 entries:3 Queue qLen:0 netMsg/s:0"
Oct 24 10:28:38 ip-10-30-3-172 dockerd[9549]: time="2017-10-24T10:28:38.809328126Z" level=info msg="NetworkDB stats ip-10-30-3-172(702a7b3ef9fb) - netID:jpnm7m44qrmrn9dmtn03eh5np leaving:false netPeers:22 entries:148 Queue qLen:0 netMsg/s:0"
Oct 24 10:28:38 ip-10-30-3-172 dockerd[9549]: time="2017-10-24T10:28:38.809351545Z" level=info msg="NetworkDB stats ip-10-30-3-172(702a7b3ef9fb) - netID:ioyje49xpwlphng25relbwxly leaving:false netPeers:15 entries:114 Queue qLen:0 netMsg/s:0"

This is the info on the worker:

root@ip-10-30-3-172:~# docker version
Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:04:16 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:02:56 2017
 OS/Arch:      linux/amd64
 Experimental: false
root@ip-10-30-3-172:~# docker info
Containers: 7
 Running: 4
 Paused: 0
 Stopped: 3
Images: 11
Server Version: 17.10.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: tc18wb549eycwa7e8hz9ytwox
 Is Manager: false
 Node Address: 10.30.3.172
 Manager Addresses:
  10.30.1.88:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1022-aws
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.795GiB
Name: ip-10-30-3-172
ID: JYSX:SM4Y:XMUE:GB6Y:J4X3:L4GL:2X6S:K57M:QWDO:ZJXY:RPS2:CSAF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 staging=backend
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support
omerh commented 7 years ago

It seems that the order: start-first in the stack deploy causes the issue. I ran a simple update

deploy@swarm-as-prod-01:~/swarm$ sudo docker service update production_bb_service --image service:v1.192 --with-registry-auth --detach=false
production_bb-service
overall progress: rolling back update: 2 out of 3 tasks
1/3: Failed joining production_bb_default-endpoint to sandbox production_bb_def…
2/3: running   [>                                                  ]
3/3: running   [>                                                  ]
rollback: update rolled back due to failure or early termination of task 091qpjl7vq7vukfgrpfl5b2rk
service rollback paused: update paused due to failure or early termination of task px7jm8rdtpg0f3g63nyf6wmml

After removing the order: start-first and defaulting to stop-first stack deploy and docker service update started to work

thaJeztah commented 7 years ago

ping @aaronlehmann @dnephin PTAL

dnephin commented 7 years ago

The names in the error message don't seem to match the names in your Compose file, I guess you changed them?

Seems like more of an issue with networking/swarmkit, don't think it's related to stacks.

omerh commented 7 years ago

Yes, deleted them. You can ignore them. And, yes. Not related to stack.

It happens to a service that update config order was set to start-first on stack deploy and on service update.

mcodd commented 7 years ago

Hi, I'm seeing the same issue when not using order: start-first. It seems like this could be related to something truncating names (specifically something truncating a thing it thinks is an ID to 12 characters, but it's not actually an ID and so we're getting collisions). In the initial report above, note the error message:

"Failed joining staging-endpoint to sandbox staging_backend_sbox: container staging_backend_sbox: endpoint create on GW Network failed: endpoint with name gateway_staging_back already exists in network docker_gwbridge"

It looks like "staging_backend_sbox" was truncated to "staging_back" (12 characters). In my case when I do a "network inspect docker_gwbridge" I end up with some similarly truncated results in 17.10 (the 12 characters being prod-swarm_d and prod-swarm_t in the output below). This is causing tasks to throw errors for me similar to the one above when there are other networks (containers?) with similar names... I don't have a full example of a config and error message at this point that's causing problems because we've rolled back to 17.09 (which does not have those truncated names, it only seems to show the "ingress-sbox" as a non-ID container name). Hopefully this is helpful enough for folks who understand more about how these names are generated, and what may have changed between 17.09 and 17.10, to investigate?

[
    {
        "Name": "docker_gwbridge",
        "Id": "c9edd6bab9f1f74be2bf77b36a5b5f0cb0da287ad6493667b04eb4944a83a221"
,
        "Created": "2017-10-05T21:37:13.969399402Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "2b4ec0cf5755e095b69e250540cf98d702fdbabcbbf245130fc313bc23ea731b": 
{
                "Name": "gateway_2b4ec0cf5755",
                "EndpointID": "426e729258f2f1180a70f0bfe8dd087b02d6f2e5db501e0a5
739fa24d9184984",
                "MacAddress": "02:42:ac:12:00:08",  
                "IPv4Address": "172.18.0.8/16",
                "IPv6Address": ""
            },
            "3e26fadfbec8b336b7e3863945243b27d6d1557877a0cba3483ea5aff56225dc": 
{
                "Name": "gateway_3e26fadfbec8",
                "EndpointID": "7a29b4238693e3ee40c74185e003c55c5a1535d753ddb1009
e3af682c8aa5570",
                "MacAddress": "02:42:ac:12:00:05",  
                "IPv4Address": "172.18.0.5/16",
                "IPv6Address": ""
            },
            "8b082b47dd6cf5eb87aacc8320bb67fca2e2da8f6faeb303831bdee7d00c384c": {
                "Name": "gateway_8b082b47dd6c",
                "EndpointID": "87264c87a6825b8b5832879d3084cc21cbcab8f9fc9320ff5b97788c65a41d9a",
                "MacAddress": "02:42:ac:12:00:07",  
                "IPv4Address": "172.18.0.7/16",
                "IPv6Address": ""
            },
            "dd051896127ba0cf4b8da85fc505e697b953e047f32ad733e22f945403f8e69e": {
                "Name": "gateway_dd051896127b",
                "EndpointID": "2008f75d6a3c8aa25e69386720842d020e1bbc285bfd873480c994660aa34b6b",
                "MacAddress": "02:42:ac:12:00:04",  
                "IPv4Address": "172.18.0.4/16",
                "IPv6Address": ""
            },
            "ingress-sbox": {
                "Name": "gateway_ingress-sbox",
                "EndpointID": "0929b42a8ccd906a197786142d6a3edefa818dcfd88394ca6b4ac87d748a83d3",
                "MacAddress": "02:42:ac:12:00:02",  
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": ""
            },
            "prod-swarm_default-sbox": {
                "Name": "gateway_prod-swarm_d",
                "EndpointID": "e6e84cd23010772e60cf7b0c33848c1e4cc0573200987fc0c5697a00d34d5bf6",
                "MacAddress": "02:42:ac:12:00:03",  
                "IPv4Address": "172.18.0.3/16",
                "IPv6Address": ""
            },
            "prod-swarm_traefikext-sbox": {
                "Name": "gateway_prod-swarm_t",
                "EndpointID": "5f02e77af1afde21aacd0f96ae2736ca11cfdcff121d057dd7171ca88ebe9057",
                "MacAddress": "02:42:ac:12:00:06",  
                "IPv4Address": "172.18.0.6/16",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.bridge.enable_icc": "false",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.name": "docker_gwbridge"
        },
        "Labels": {}
    }
]
omerh commented 7 years ago

Just thinking... might be related to the point that our containers run without STOPSIGNAL? the engine kills the container after grace failed so it restart the container, thus the stack deploy fails

@mcodd do you have a STOPSIGNAL in your Dockerfile ?

mcodd commented 7 years ago

We do not have STOPSIGNAL in our Dockerfile... for the tasks failing to start, the error message really suggests something at the networking layer - i don't know enough swarm internals to understand why those -sbox containers are being spawned on the docker_gwbridge network, but their names are obviously colliding with similarly named things already on the network...

I see the gwEPlen constant being set to 12 at https://github.com/docker/libnetwork/blob/master/default_gateway.go, and i think that's what's being used to come up with the truncated Name's in my network inspect above, but I also think there is an implicit assumption that these things are going to be named with UUID-type ID's and not the names that I'm using for my networks... Again, this seems to be a thing that is different in 17.10 because 17.09 doesn't show this issue at all in my environment.

cirocosta commented 6 years ago

I've been seeing this happening lately as well

docker version
Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:05:05 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:03:46 2017
 OS/Arch:      linux/amd64
 Experimental: true

and yeah, we're testing it using some >12chars network names:

docker network ls
NETWORK ID          NAME                                      DRIVER              SCOPE
6ea857481951        bridge                                    bridge              local
a7a12hv1g9gi        com-somethin-network-256187400872361471   overlay             swarm
cpljozi32k4k        com-somethin-network-256203857689669114   overlay             swarm
qozdck9ltsbg        com-somethin-network-256214232243926931   overlay             swarm
lu2et741iimz        com-somethin-network-256216521800418289   overlay             swarm
k8a6seby57qh        com-somethin-network-256219163851389904   overlay             swarm
ruz80ohxo7zz        com-somethin-network-256225798573509330   overlay             swarm
0yx5uf8dt143        com-somethin-network-256226274643705108   overlay             swarm
kx04bzu88ezm        com-somethin-network-256227170786275798   overlay             swarm
mdbw2xsyhk0j        com-somethin-network-256227215782655894   overlay             swarm
v447udw8f6gw        com-somethin-network-256229288461242460   overlay             swarm
o6ruzir8f3hg        com-somethin-network-256232319955037343   overlay             swarm
w1r347rjcb3w        com-somethin-network-256232538421673948   overlay             swarm
wlycyfhwo593        com-somethin-network-256233842189380880   overlay             swarm
5lu6w9l7n5kn        com-somethin-network-256234462216996878   overlay             swarm
fwu9mkc043x0        com-somethin-network-256239854449434693   overlay             swarm
2ca3cdb78dc7        docker_gwbridge                           bridge              local
44c3f7593757        host                                      host                local
r6ligfm7chw9        ingress                                   overlay             swarm
ccb59f75d8a2        none                                      null                local
DBLaci commented 6 years ago

Docker 17.11

Rejected:

Failed joining xxxxxx-dev-yyyyyy-indexer-elastic-endpoint to sandbox xxxxxx-dev-yyyyyy-indexer-elastic-sbox: container xxxxxx-dev-yyyyyy-indexer-elastic-sbox: endpoint create on GW Network failed: endpoint with name gateway_xxxxxx-dev-s already exists in network docker_gwbridge

I try to rename my networks to shorter as a workaround, but it is not easy as I use constants to contanenate the network names and it seems i have only 1 character left or else I have to restructure the whole deploy system...

DBLaci commented 6 years ago

Additional information: this problem is not always occuring when the truncated network names are colliding. Recreating the network helps.

mcodd commented 6 years ago

I'm curious if any progress has been made on this issue - i see that #35310 could have been related but @DBLaci comment suggests the truncation at 12 characters is still occurring - this is currently keeping us from moving off of 17.09 to versions that could fix some other network-related issues we're seeing... trying to understand the nature of the problem a little better, it seems like the containers on the docker_gwbridge network have been given keys in 17.10+ that are no longer those long UUID-type strings but instead are names related to services, truncated at 12 characters? i'm surprised that there aren't more folks having this issue and i'm wondering if there's something in my environment that's unique that means i'd be seeing this issue when others don't? i can't imagine that in general everyone is using really short service names?

thaJeztah commented 6 years ago

ping @pradipd @msabansal @madhanrm @mavenugo PTAL ^^

kinghuang commented 6 years ago

I'm also running into this issue with 17.10.0-ce. Is there a workaround? I haven't been able to figure out why it affects some services/networks and not others.

Failed joining wte-93-faas_perimeter-endpoint to sandbox wte-93-faas_perimeter-sbox: container wte-93-faas_perimeter-sbox: endpoint create on GW Network failed: endpoint with name gateway_wte-93-faas_ already exists in network docker_gwbridge
thaJeztah commented 6 years ago

ping @pradipd has this been addressed?

pradipd commented 6 years ago

I think so, but, I'm not 100% certain because I can't tell if there are multiple issues in this thread or all issues are the same. The issue with the truncation to 12 char that @mcodd describes above (https://github.com/moby/moby/issues/35281#issuecomment-339984017), should be addressed by commit https://github.com/moby/moby/pull/35422 . My understanding is that the fix should be in 17.12.

If there are other issues beyond the truncation, then we will have to investigate those issues.

selansen commented 6 years ago

I am working on this issue. I tested 17.11 and reproduced the issue. I started working on the issue. if it is already fixed at 17.12, I will move to different issue.

pradipd commented 6 years ago

If you could give me the repro steps, I'll validate it is fixed in 17.12

selansen commented 6 years ago

I have the setup now. Will try it with 17.12 and let you know.

selansen commented 6 years ago

I have tested same scenario with image 17.12 and the problem doesnt happen in 17.12. Looks like this issue was caused due to regression and @pradipd commit fixed the issue in 17.12. We are good to close this issue. Below output from 17.12 image shows extra sandbox creation which caused the issue (17.10 onwards) is not present in 17.12.

docker@ELANGO-CE-EDGE-ubuntu-0:~$ docker network inspect -v docker_gwbridge [ { "Name": "docker_gwbridge", "Id": "ce7e6941145345011d1c07f776b9657430897cb16c26ee16c84aa5d045bf5f51", "Created": "2017-12-08T10:04:00.39018803-08:00", "Scope": "local", "Driver": "bridge", "EnableIPv6": false, "IPAM": { "Driver": "default", "Options": null, "Config": [ { "Subnet": "172.18.0.0/16", "Gateway": "172.18.0.1" } ] }, "Internal": false, "Attachable": false, "Ingress": false, "ConfigFrom": { "Network": "" }, "ConfigOnly": false, "Containers": { "2deb67dcd42cfe7464cb0f8d2e66ec3d0a70053aaa1e33f718a8ac89c1a4a0f3": { "Name": "gateway_2deb67dcd42c", "EndpointID": "f88e2379020d38da16435120c28e141feb74187ccd4c3f87cf6ad26dba5afdaf", "MacAddress": "02:42:ac:12:00:04", "IPv4Address": "172.18.0.4/16", "IPv6Address": "" }, "82ee91db89bfd0fd35f1b1cefc381517ab0683d34ad83e6da736ff0b7015b0e0": { "Name": "gateway_82ee91db89bf", "EndpointID": "36fc341c5433c199d91b4163adb4a3e4288de609e8d9885df72028817a810fc0", "MacAddress": "02:42:ac:12:00:05", "IPv4Address": "172.18.0.5/16", "IPv6Address": "" }, "96032523a574afe4c51effdfeb92fcacfffc77ad5bc5099c253edd14ab284de4": { "Name": "gateway_96032523a574", "EndpointID": "2637a5b1222296967d609fb241d1ebe65cfc09826490829665cb2328dde6e316", "MacAddress": "02:42:ac:12:00:06", "IPv4Address": "172.18.0.6/16", "IPv6Address": "" }, "f7bafd5b6e5487a34b248e69413771d0cf8b4102316a41b0e25139e766163843": { "Name": "gateway_f7bafd5b6e54", "EndpointID": "fa90f22b9609a4088882739cd4513715ae74ec47d44f342958fc2b2dd7160b34", "MacAddress": "02:42:ac:12:00:03", "IPv4Address": "172.18.0.3/16", "IPv6Address": "" }, "ingress-sbox": { "Name": "gateway_ingress-sbox", "EndpointID": "d45b2d18dea148be00ad416e3ab64fc690df4a97b6ce240076759226f6924d6d", "MacAddress": "02:42:ac:12:00:02", "IPv4Address": "172.18.0.2/16", "IPv6Address": "" } }, "Options": { "com.docker.network.bridge.enable_icc": "false", "com.docker.network.bridge.enable_ip_masquerade": "true", "com.docker.network.bridge.name": "docker_gwbridge" }, "Labels": {} }

dnephin commented 6 years ago

Thanks for confirming the fix!