Open fcrisciani opened 7 years ago
@dperny: Do I remember correctly that you found some kind of race that could leave containers in the Created
state when a service is deleted? I wonder if this is similar.
@fcrisciani: Could you please attach the compose files so this can be reproduced?
I also put the repo above, anyway to make it easier here they are. I had to rename them for the upload, they are .yml
Reproduced this, and at first thought something was wrong, but actually I think it's expected behavior.
We keep around the containers until the tasks get purged from the task list. This lets the administrator look at old containers to see why they failed. By default, we keep 4 old tasks for each slot in addition to the current running task.
When I ran the script, test_srv1
and test_srv2
had two replicas each, and each had the default 4 historic tasks preserved. This is a total of 16 dead tasks still tracked by swarmkit. When I ran docker ps -a
, I saw exactly 16 containers in the "Created" state. The IDs seem to match, as well.
So if I run this in a loop every iteration will leave 4 dead containers correct? Can you confirm that all the resources of that containers are released? Like ip address etc, with the inspect them I can see still ip reported.
On Jun 20, 2017 5:58 PM, "Aaron Lehmann" notifications@github.com wrote:
Reproduced this, and at first thought something was wrong, but actually I think it's expected behavior.
We keep around the containers until the tasks get purged from the task list. This lets the administrator look at old containers to see why they failed. By default, we keep 4 old tasks for each slot in addition to the current running task.
When I ran the script, test_srv1 and test_srv2 had two replicas each, and each had the default 4 historic tasks preserved. This is a total of 16 dead tasks still tracked by swarmkit. When I ran docker ps -a, I saw exactly 16 containers in the "Created" state. The IDs seem to match, as well.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/swarmkit/issues/2273#issuecomment-309932528, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvTiVRrpaxx7j-OBstGFCHyGmDj1KUQks5sGGrEgaJpZM4OASe6 .
So if I run this in a loop every iteration will leave 4 dead containers correct?
No, the total should not change after each iteration. There should be up to 4 dead containers for every running service replica.
Can you confirm that all the resources of that containers are released? Like ip address etc, with the inspect them I can see still ip reported.
They should be released when the agent reports that the task has failed.
ok then the issue is there still:
These is a snapshot that I took now from my setup after 17 iterations:
b040f0762dde alpine:latest "op" 10 seconds ago Created test_srv2.23.kcsunycxq0wpf306coyjivacd
b645eee8ca4f alpine:latest "op" 12 seconds ago Created test_srv2.14.o30gkvuy0es5jf0gb5eigy3ae
04317b931348 alpine:latest "op" 20 seconds ago Created test_srv1.20.z4q5jhcailmwj5gf45mc3tt9m
bca1197709e6 alpine:latest "op" 23 seconds ago Created test_srv1.13.hgygjqh51cpnvoy2o5dm6wrey
dd2da9f5a6c4 alpine:latest "op" 35 seconds ago Created test_srv2.18.z2r4cljdy760hwzp8rkv0a7x4
4c5463f774da alpine:latest "op" About an hour ago Created test_srv2.12.y7fryiqwhl3lilenx6i8lc323
63ef78b41e8b alpine:latest "op" About an hour ago Created test_srv1.21.ply94ufp13foflk7450w60i18
dea789cc22ca alpine:latest "op" About an hour ago Created test_srv1.23.zjuso3q4vk2ngdwz6kg9tnm4y
a6955f20bd21 alpine:latest "op" About an hour ago Created test_srv1.4.gt7ulveshuy2y0kuehgixzcam
e2d90577f469 alpine:latest "op" About an hour ago Created test_srv1.5.5fqyb1elxudwfbputahte9n2k
3618c47b5ed6 alpine:latest "op" About an hour ago Created test_srv1.17.lw9wdkq1561ptnwbqa9eosnfj
75720eb65881 alpine:latest "op" About an hour ago Created test_srv1.9.tpdnrmm0r2tvcdcdug5armblo
e91c23aa1161 alpine:latest "op" About an hour ago Created test_srv1.3.nlzcgpsh9ys6gjyjxjpqj6kzs
dfb387b40cfd alpine:latest "op" About an hour ago Created test_srv1.12.qap5ei6cya3i079ze7ztb7son
eaaec0b9d22a alpine:latest "op" About an hour ago Created test_srv1.5.kpv2mqgmcwsxaolu2mozhsf02
3f2a72d20d2c alpine:latest "op" About an hour ago Created test_srv1.3.clvii1tcz434voxxr58xgp39e
3f4ba5d7d615 alpine:latest "op" 2 hours ago Created test_srv2.8.iqh8n46l2g7co9dn10mpjbdue
25f389b2d22d alpine:latest "op" 2 hours ago Created test_srv2.5.hbr01p4tt97x15f8rdd7ttjxk
5ed0c001ad2e alpine:latest "op" 2 hours ago Created test_srv2.10.rmfantg4ekapolfvhy8c2t48n
7447d1f785e2 alpine:latest "op" 2 hours ago Created test_srv2.12.v0n8jo4pjlhm7um300j5dsmoe
a4efe12e58d3 alpine:latest "op" 2 hours ago Created test_srv2.4.qv463nmpto60wneebx89fdyqy
8a54f7e621e3 alpine:latest "op" 2 hours ago Created test_srv2.25.kp6er35y6ta1q3gvfwssu1a0e
0259d7dbb307 alpine:latest "op" 2 hours ago Created test_srv1.18.7ia6uu4j6jw4rgnt36i6ahuo9
1d58ea906cf8 alpine:latest "op" 2 hours ago Created test_srv1.2.ozbelroftqnb2iv9y3uj1htmf
acc5df52e30f alpine:latest "op" 2 hours ago Created test_srv2.12.mm5k836picomx35s9yz7csamf
60c1cbcb1e2f alpine:latest "op" 2 hours ago Created test_srv2.4.negrw6id3qruistcav80xjzqu
e641093f5429 alpine:latest "op" 2 hours ago Created test_srv1.20.m9no1913scrnr0hvbielv94ai
5f0f92698d37 alpine:latest "op" 2 hours ago Created test_srv2.1.kdlx4uyqhtv2ppy9aw2ociknd
e2a3216d6275 alpine:latest "op" 2 hours ago Created test_srv1.8.v9yvwy1sb79hm3hwtux0bo05k
255175c0df36 alpine:latest "op" 2 hours ago Created test_srv2.21.ue6849gvl9bq6b2bwirpt7ipt
dd81382d6f3b alpine:latest "op" 2 hours ago Created test_srv1.16.lhhi8qg5lf8yevej98agohb6b
0ba65fa69d8e alpine:latest "op" 2 hours ago Created test_srv2.24.tsw9h6aozpmdksm539nbfhjfw
308fdae83bc6 alpine:latest "op" 2 hours ago Created test_srv2.10.bq66vs8b23xsket3n76lefi9t
17 entries are from 2h ago while 11 are from 1h ago
Okay, thanks, I'll look into that.
Can you paste the output of docker service ps
for each service?
Discussed a bit with @fcrisciani.
There are errors like this in the log:
time="2017-06-21T00:27:41.053319465Z" level=error msg="remove task failed" error="unknown network lab-network id 91mx5g37wnmr6kjbvvmk6w2af" module="node/agent" node.id=s4cnwq5hopzv8cxl8k40i45wf task.id=y7fryiqwhl3lilenx6i8lc323
It appears that (*network).delete
is returning a UnknownNetworkError when trying to delete a network:
This swallows the real error returned by getNetworkFromStore
, so I'm wondering if it might actually be encountering an error trying to fetch the endpoint count.
Or could there be something wrong with reference counting, so that the network gets incorrectly deleted, even though at least one task is still referencing it?
This error message exists on the [newly released] 17.12.0-ce
as well... Makes CI/CD sort of a pain when we have to write scripts to clean-up "left-overs".
docker stack deploy -c ./mycompose.yml example
docker stack rm example
(docker ps -a
shows many containers that have exited, are part of the stack, but are not removed.)
dockerd[5638]: time="2018-01-03T15:59:03.983781615-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=45r0vxem02pobyy7cyz9a6go1
dockerd[5638]: time="2018-01-03T15:59:03.985594943-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:03.985634229-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=77gareatp6jxf7m0jbe9rvqbl
dockerd[5638]: time="2018-01-03T15:59:04.298354526-05:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/containers/delete type="*events.ContainerDelete"
dockerd[5638]: time="2018-01-03T15:59:04.982954577-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:04.983022362-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=zl512dke8x96ju8fcw3eiw7dk
dockerd[5638]: time="2018-01-03T15:59:04.984177429-05:00" level=error msg="network example_frontend remove failed: network example_frontend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:04.984252914-05:00" level=error msg="remove task failed" error="network example_frontend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=2vqqvgo9x9tu7ttfeuhbod8tr
dockerd[5638]: time="2018-01-03T15:59:05.884230284-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:05.884253861-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=bdmhf3zdzv4d9jcp90hh0uwi2
dockerd[5638]: time="2018-01-03T15:59:05.884528976-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:05.884544505-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=7fvmq4ahp28r1m7xct6ptbnww
dockerd[5638]: time="2018-01-03T15:59:06.688436988-05:00" level=error msg="network example_backend remove failed: network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m
dockerd[5638]: time="2018-01-03T15:59:06.688510920-05:00" level=error msg="remove task failed" error="network example_backend not found" module=node/agent node.id=ps8gxl6j3htw9qalhvnfr3u9m task.id=f71pwgga6o2ebef59vfbojklt
Steps to reproduce: Run this script a couple of times:
Results:
These containers are never deleted
files can be found here: https://github.com/bzcnsh/temprepo01