Closed itamarst closed 10 years ago
My use case, BTW, is running automated functional tests for our project's geard integration. But this is also worrying for eventual production use.
@mfojtik can you look into this?
@smarterclayton sure, I will be looking into this today
I can reproduce this as well, same as @itamarst describe, looking at the queue @smarterclayton described in the linked issue.
@smarterclayton the problem I see is, that when you do what @itamarst describes, it seems like the geard
got stuck in a job START:
Jun 12 12:08:06 dev.origin.com gear[29383]: 2014/06/12 12:08:06 204 19.58ms PUT /container/busybox/started
Jun 12 12:08:21 dev.origin.com gear[29383]: job START *linux.stopContainer, uDKMHJGCBO2wmUASfq4new: &{StoppedContainerStateRe...ce5f0}
Jun 12 12:08:21 dev.origin.com gear[29383]: job END uDKMHJGCBO2wmUASfq4new
Jun 12 12:08:21 dev.origin.com gear[29383]: 2014/06/12 12:08:21 204 7.73ms PUT /container/busybox/stopped
Jun 12 12:08:21 dev.origin.com gear[29383]: job START *linux.startContainer, TUVnGWCLizCWHGcT4qBlFA: &{StartedContainerStateR...ce5f0}
Jun 12 12:08:21 dev.origin.com gear[29383]: job END TUVnGWCLizCWHGcT4qBlFA
Jun 12 12:08:21 dev.origin.com gear[29383]: 2014/06/12 12:08:21 204 7.66ms PUT /container/busybox/started
Jun 12 12:08:27 dev.origin.com gear[29383]: job START *linux.stopContainer, sxSSmdVKKX_GQ7hdtx40Cw: &{StoppedContainerStateRe...ce5f0}
.... no END
here the stopContainer will hang...
@smarterclayton fyi, setting Concurrent
to 8
does not fix this :-( and you are right, it is the StopUnitJob
that is hanging... I'm debugging it now.
UPDATE: It might be a locking problem... go-systemd
startJob
method hangs when trying to acquire jobListener.Lock()
I'm on it :-) The jobComplete
seems to take a Lock and under some circumstances it never release it... It seems like StartContainer cause it.
Destructive job queuing maybe (start is using "replace"), maybe the previous systemd job never completes to return to the event subscription loop.
@smarterclayton yeah that might be... maybe putting some sleep to jobComplete to give systemd finish the job may fix this.
I don't see how sleeping helps - I just meant that there may be a flawed assumption that jobComplete is ever going to happen.
@smarterclayton actually both start and stop use 'replace'
@smarterclayton so adding:
time.Sleep(1 * time.Second)
to both stopContainer
and startContainer
after calling systemd.Connection().StopUnitJob
fix this.
Sure - we just wouldn't use that to fix it. Guess I'm asking what the underlying race condition in go-systemd is.
@smarterclayton sure, but it proves there is a race between stop/start in go-systemd, so I will do more digging :-)
Btw. I just found that the StartUnit
action return 'canceled' instead of done when you execute it right after StopUnit
action.
@smarterclayton how about:
err := systemd.Connection().StopUnitJob(unitName, "replace-irreversibly")
seems to work perfectly fine :-)
If "replace-irreversibly" is specified, operate like "replace", but also mark the new jobs as irreversible. This prevents future conflicting transactions from replacing these jobs (or even being enqueued while the irreversible jobs are still pending). Irreversible jobs can still be cancelled using the cancel command.
Not what we want, if I'm reading correctly. Subsequent start/stop should cancel a stop/start.
@smarterclayton well we can detect if the start/stop failed due to Transaction is destructive
error. SystemD will prevent the operation if there is other operation currently being executed. In that case we can just wait and retry.
@smarterclayton nvmd, just got the 'cancel' part ;-)
We don't want to wait and retry - we want to make systemd handle destructive cleanly. There is a bug in go-systemd and it needs to be fixed.
EDIT: since i was the one who added start job I'm pretty sure this is just a bug in the event listener logic introduced by the new concept - go-systemd just never had to deal with a cancelled job before (an admin could cause this today by running systemctl at the same time).
@smarterclayton I think I have fix, PR on the way
@mfojtik how's it going? I really want to take the sleep(1)
out of my code :grin:
This was merged - let us know if it's fixed
My functional tests no longer lock up geard, so I think it's good. Thank you!
This works just fine:
If however I start and then stop the container in quick succession, often the second curl command just hangs, as do e.g. additional commands like deleting a unit.
This is Fedora 20, gear version alpha3, build 7516dab (built from checkout a couple hours ago using vendored dependencies).
Possibly a duplicate of #166.