moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.37k stars 616 forks source link

Container left in Created state #2273

Open fcrisciani opened 7 years ago

fcrisciani commented 7 years ago

Steps to reproduce: Run this script a couple of times:

#!/bin/bash
docker network create -d overlay --subnet 192.168.23.0/24 --opt com.docker.network.driver.mtu=931 lab-network
docker stack deploy -c docker-compose-bad.yml test
sleep 300
docker stack deploy -c docker-compose.yml test

Results:

7c9cd869d90b        alpine:latest       "op"                17 minutes ago       Created                                 test_srv2.12.tdqwor9o8v8sl6eda2hvxj5kj                                                                                                                 
ebad82ebb9b5        alpine:latest       "op"                17 minutes ago       Created                                 test_srv2.5.an8y551b5c99e1u7kdcumeguz                                                                                                                  
2394cc396fdc        alpine:latest       "op"                17 minutes ago       Created                                 test_srv2.19.xsnwe66tb942wsd0h1z2vgclx                                                                                                                 
830875b700b9        alpine:latest       "op"                17 minutes ago       Created                                 test_srv2.13.y63yxihwy76qdlzyuvnewffrj                                                                                                                 
bab7046702e8        alpine:latest       "op"                17 minutes ago       Created                                 test_srv1.10.qrg8h25aoh9cj71hiuqb2xwce                                                                                                                 
60060c99e02d        alpine:latest       "op"                17 minutes ago       Created                                 test_srv1.9.hplrcfayt7mes8k2jxosnifnl                                                                                                                  
0113d15d3900        alpine:latest       "op"                17 minutes ago       Created                                 test_srv1.21.it7nhycc4jkplox60700zzepu                                                                                                                 
868e7c243b80        alpine:latest       "op"                17 minutes ago       Created                                 test_srv2.8.wywzsd4o6nu8yzg2yk0vpmkla                                                                                                                  
62ecd798d1e5        alpine:latest       "op"                17 minutes ago       Created                                 test_srv2.9.g1ygzxgs8wd5gwbv474zwll5g

These containers are never deleted

files can be found here: https://github.com/bzcnsh/temprepo01

aaronlehmann commented 7 years ago

@dperny: Do I remember correctly that you found some kind of race that could leave containers in the Created state when a service is deleted? I wonder if this is similar.

@fcrisciani: Could you please attach the compose files so this can be reproduced?

fcrisciani commented 7 years ago

I also put the repo above, anyway to make it easier here they are. I had to rename them for the upload, they are .yml

docker-compose-bad.txt docker-compose.txt

aaronlehmann commented 7 years ago

Reproduced this, and at first thought something was wrong, but actually I think it's expected behavior.

We keep around the containers until the tasks get purged from the task list. This lets the administrator look at old containers to see why they failed. By default, we keep 4 old tasks for each slot in addition to the current running task.

When I ran the script, test_srv1 and test_srv2 had two replicas each, and each had the default 4 historic tasks preserved. This is a total of 16 dead tasks still tracked by swarmkit. When I ran docker ps -a, I saw exactly 16 containers in the "Created" state. The IDs seem to match, as well.

fcrisciani commented 7 years ago

So if I run this in a loop every iteration will leave 4 dead containers correct? Can you confirm that all the resources of that containers are released? Like ip address etc, with the inspect them I can see still ip reported.

On Jun 20, 2017 5:58 PM, "Aaron Lehmann" notifications@github.com wrote:

Reproduced this, and at first thought something was wrong, but actually I think it's expected behavior.

We keep around the containers until the tasks get purged from the task list. This lets the administrator look at old containers to see why they failed. By default, we keep 4 old tasks for each slot in addition to the current running task.

When I ran the script, test_srv1 and test_srv2 had two replicas each, and each had the default 4 historic tasks preserved. This is a total of 16 dead tasks still tracked by swarmkit. When I ran docker ps -a, I saw exactly 16 containers in the "Created" state. The IDs seem to match, as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/swarmkit/issues/2273#issuecomment-309932528, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvTiVRrpaxx7j-OBstGFCHyGmDj1KUQks5sGGrEgaJpZM4OASe6 .

aaronlehmann commented 7 years ago

So if I run this in a loop every iteration will leave 4 dead containers correct?

No, the total should not change after each iteration. There should be up to 4 dead containers for every running service replica.

Can you confirm that all the resources of that containers are released? Like ip address etc, with the inspect them I can see still ip reported.

They should be released when the agent reports that the task has failed.

fcrisciani commented 7 years ago

ok then the issue is there still:

These is a snapshot that I took now from my setup after 17 iterations:

b040f0762dde        alpine:latest       "op"                10 seconds ago      Created                                 test_srv2.23.kcsunycxq0wpf306coyjivacd
b645eee8ca4f        alpine:latest       "op"                12 seconds ago      Created                                 test_srv2.14.o30gkvuy0es5jf0gb5eigy3ae
04317b931348        alpine:latest       "op"                20 seconds ago      Created                                 test_srv1.20.z4q5jhcailmwj5gf45mc3tt9m
bca1197709e6        alpine:latest       "op"                23 seconds ago      Created                                 test_srv1.13.hgygjqh51cpnvoy2o5dm6wrey
dd2da9f5a6c4        alpine:latest       "op"                35 seconds ago      Created                                 test_srv2.18.z2r4cljdy760hwzp8rkv0a7x4
4c5463f774da        alpine:latest       "op"                About an hour ago   Created                                 test_srv2.12.y7fryiqwhl3lilenx6i8lc323
63ef78b41e8b        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.21.ply94ufp13foflk7450w60i18
dea789cc22ca        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.23.zjuso3q4vk2ngdwz6kg9tnm4y
a6955f20bd21        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.4.gt7ulveshuy2y0kuehgixzcam
e2d90577f469        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.5.5fqyb1elxudwfbputahte9n2k
3618c47b5ed6        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.17.lw9wdkq1561ptnwbqa9eosnfj
75720eb65881        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.9.tpdnrmm0r2tvcdcdug5armblo
e91c23aa1161        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.3.nlzcgpsh9ys6gjyjxjpqj6kzs
dfb387b40cfd        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.12.qap5ei6cya3i079ze7ztb7son
eaaec0b9d22a        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.5.kpv2mqgmcwsxaolu2mozhsf02
3f2a72d20d2c        alpine:latest       "op"                About an hour ago   Created                                 test_srv1.3.clvii1tcz434voxxr58xgp39e
3f4ba5d7d615        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.8.iqh8n46l2g7co9dn10mpjbdue
25f389b2d22d        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.5.hbr01p4tt97x15f8rdd7ttjxk
5ed0c001ad2e        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.10.rmfantg4ekapolfvhy8c2t48n
7447d1f785e2        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.12.v0n8jo4pjlhm7um300j5dsmoe
a4efe12e58d3        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.4.qv463nmpto60wneebx89fdyqy
8a54f7e621e3        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.25.kp6er35y6ta1q3gvfwssu1a0e
0259d7dbb307        alpine:latest       "op"                2 hours ago         Created                                 test_srv1.18.7ia6uu4j6jw4rgnt36i6ahuo9
1d58ea906cf8        alpine:latest       "op"                2 hours ago         Created                                 test_srv1.2.ozbelroftqnb2iv9y3uj1htmf
acc5df52e30f        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.12.mm5k836picomx35s9yz7csamf
60c1cbcb1e2f        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.4.negrw6id3qruistcav80xjzqu
e641093f5429        alpine:latest       "op"                2 hours ago         Created                                 test_srv1.20.m9no1913scrnr0hvbielv94ai
5f0f92698d37        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.1.kdlx4uyqhtv2ppy9aw2ociknd
e2a3216d6275        alpine:latest       "op"                2 hours ago         Created                                 test_srv1.8.v9yvwy1sb79hm3hwtux0bo05k
255175c0df36        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.21.ue6849gvl9bq6b2bwirpt7ipt
dd81382d6f3b        alpine:latest       "op"                2 hours ago         Created                                 test_srv1.16.lhhi8qg5lf8yevej98agohb6b
0ba65fa69d8e        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.24.tsw9h6aozpmdksm539nbfhjfw
308fdae83bc6        alpine:latest       "op"                2 hours ago         Created                                 test_srv2.10.bq66vs8b23xsket3n76lefi9t

17 entries are from 2h ago while 11 are from 1h ago

aaronlehmann commented 7 years ago

Okay, thanks, I'll look into that.

Can you paste the output of docker service ps for each service?

aaronlehmann commented 7 years ago

Discussed a bit with @fcrisciani.

There are errors like this in the log:

time="2017-06-21T00:27:41.053319465Z" level=error msg="remove task failed" error="unknown network lab-network id 91mx5g37wnmr6kjbvvmk6w2af" module="node/agent" node.id=s4cnwq5hopzv8cxl8k40i45wf task.id=y7fryiqwhl3lilenx6i8lc323 

It appears that (*network).delete is returning a UnknownNetworkError when trying to delete a network:

https://github.com/moby/moby/blob/6868b8ef230fea7bc5758e2d252d4297033029fa/vendor/github.com/docker/libnetwork/network.go#L936-L939

This swallows the real error returned by getNetworkFromStore, so I'm wondering if it might actually be encountering an error trying to fetch the endpoint count.

Or could there be something wrong with reference counting, so that the network gets incorrectly deleted, even though at least one task is still referencing it?

ph-One commented 6 years ago

This error message exists on the [newly released] 17.12.0-ce as well... Makes CI/CD sort of a pain when we have to write scripts to clean-up "left-overs".

docker stack deploy -c ./mycompose.yml example docker stack rm example

(docker ps -a shows many containers that have exited, are part of the stack, but are not removed.)

dockerd[5638]: time="2018-01-03T15:59:03.983781615-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=45r0vxem02pobyy7cyz9a6go1
dockerd[5638]: time="2018-01-03T15:59:03.985594943-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:03.985634229-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=77gareatp6jxf7m0jbe9rvqbl
dockerd[5638]: time="2018-01-03T15:59:04.298354526-05:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/containers/delete type="*events.ContainerDelete"
dockerd[5638]: time="2018-01-03T15:59:04.982954577-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:04.983022362-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=zl512dke8x96ju8fcw3eiw7dk
dockerd[5638]: time="2018-01-03T15:59:04.984177429-05:00" level=error msg="network example_frontend remove failed: network example_frontend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:04.984252914-05:00" level=error msg="remove task failed" error="network example_frontend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=2vqqvgo9x9tu7ttfeuhbod8tr
dockerd[5638]: time="2018-01-03T15:59:05.884230284-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:05.884253861-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=bdmhf3zdzv4d9jcp90hh0uwi2
dockerd[5638]: time="2018-01-03T15:59:05.884528976-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:05.884544505-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=7fvmq4ahp28r1m7xct6ptbnww
dockerd[5638]: time="2018-01-03T15:59:06.688436988-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:06.688510920-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=f71pwgga6o2ebef59vfbojklt