geard freezes when unit is started/stopped in quick succession

itamarst commented 10 years ago

This works just fine:

$ curl -XPUT -d '{"Image": "busybox", "Started": true}' http://localhost:43273/container/testcontainer
$ sleep 1
$ curl -XPUT http://localhost:43273/container/testcontainer/stopped

If however I start and then stop the container in quick succession, often the second curl command just hangs, as do e.g. additional commands like deleting a unit.

$ curl -XPUT -d '{"Image": "busybox", "Started": true}' http://localhost:43273/container/testcontainer; curl -XPUT http://localhost:43273/container/testcontainer/stopped
... hangs here ...

This is Fedora 20, gear version alpha3, build 7516dab (built from checkout a couple hours ago using vendored dependencies).

Possibly a duplicate of #166.

itamarst commented 10 years ago

My use case, BTW, is running automated functional tests for our project's geard integration. But this is also worrying for eventual production use.

smarterclayton commented 10 years ago

@mfojtik can you look into this?

mfojtik commented 10 years ago

@smarterclayton sure, I will be looking into this today

mfojtik commented 10 years ago

I can reproduce this as well, same as @itamarst describe, looking at the queue @smarterclayton described in the linked issue.

mfojtik commented 10 years ago

@smarterclayton the problem I see is, that when you do what @itamarst describes, it seems like the geard got stuck in a job START:

Jun 12 12:08:06 dev.origin.com gear[29383]: 2014/06/12 12:08:06 204 19.58ms PUT /container/busybox/started
Jun 12 12:08:21 dev.origin.com gear[29383]: job START *linux.stopContainer, uDKMHJGCBO2wmUASfq4new: &{StoppedContainerStateRe...ce5f0}
Jun 12 12:08:21 dev.origin.com gear[29383]: job END   uDKMHJGCBO2wmUASfq4new
Jun 12 12:08:21 dev.origin.com gear[29383]: 2014/06/12 12:08:21 204 7.73ms PUT /container/busybox/stopped
Jun 12 12:08:21 dev.origin.com gear[29383]: job START *linux.startContainer, TUVnGWCLizCWHGcT4qBlFA: &{StartedContainerStateR...ce5f0}
Jun 12 12:08:21 dev.origin.com gear[29383]: job END   TUVnGWCLizCWHGcT4qBlFA
Jun 12 12:08:21 dev.origin.com gear[29383]: 2014/06/12 12:08:21 204 7.66ms PUT /container/busybox/started
Jun 12 12:08:27 dev.origin.com gear[29383]: job START *linux.stopContainer, sxSSmdVKKX_GQ7hdtx40Cw: &{StoppedContainerStateRe...ce5f0}

.... no END here the stopContainer will hang...

mfojtik commented 10 years ago

@smarterclayton fyi, setting Concurrent to 8 does not fix this :-( and you are right, it is the StopUnitJob that is hanging... I'm debugging it now.

UPDATE: It might be a locking problem... go-systemd startJob method hangs when trying to acquire jobListener.Lock()

I'm on it :-) The jobComplete seems to take a Lock and under some circumstances it never release it... It seems like StartContainer cause it.

smarterclayton commented 10 years ago

Destructive job queuing maybe (start is using "replace"), maybe the previous systemd job never completes to return to the event subscription loop.

mfojtik commented 10 years ago

@smarterclayton yeah that might be... maybe putting some sleep to jobComplete to give systemd finish the job may fix this.

smarterclayton commented 10 years ago

I don't see how sleeping helps - I just meant that there may be a flawed assumption that jobComplete is ever going to happen.

mfojtik commented 10 years ago

@smarterclayton actually both start and stop use 'replace'

mfojtik commented 10 years ago

@smarterclayton so adding:

time.Sleep(1 * time.Second)

to both stopContainer and startContainer after calling systemd.Connection().StopUnitJob fix this.

smarterclayton commented 10 years ago

Sure - we just wouldn't use that to fix it. Guess I'm asking what the underlying race condition in go-systemd is.

mfojtik commented 10 years ago

@smarterclayton sure, but it proves there is a race between stop/start in go-systemd, so I will do more digging :-)

Btw. I just found that the StartUnit action return 'canceled' instead of done when you execute it right after StopUnit action.

mfojtik commented 10 years ago

@smarterclayton how about:

err := systemd.Connection().StopUnitJob(unitName, "replace-irreversibly")

seems to work perfectly fine :-)

If "replace-irreversibly" is specified, operate like "replace", but also mark the new jobs as irreversible. This prevents future conflicting transactions from replacing these jobs (or even being enqueued while the irreversible jobs are still pending). Irreversible jobs can still be cancelled using the cancel command.

smarterclayton commented 10 years ago

Not what we want, if I'm reading correctly. Subsequent start/stop should cancel a stop/start.

mfojtik commented 10 years ago

@smarterclayton well we can detect if the start/stop failed due to Transaction is destructive error. SystemD will prevent the operation if there is other operation currently being executed. In that case we can just wait and retry.

mfojtik commented 10 years ago

@smarterclayton nvmd, just got the 'cancel' part ;-)

smarterclayton commented 10 years ago

We don't want to wait and retry - we want to make systemd handle destructive cleanly. There is a bug in go-systemd and it needs to be fixed.

EDIT: since i was the one who added start job I'm pretty sure this is just a bug in the event listener logic introduced by the new concept - go-systemd just never had to deal with a cancelled job before (an admin could cause this today by running systemctl at the same time).

mfojtik commented 10 years ago

@smarterclayton I think I have fix, PR on the way

itamarst commented 10 years ago

@mfojtik how's it going? I really want to take the sleep(1) out of my code :grin:

smarterclayton commented 10 years ago

This was merged - let us know if it's fixed

itamarst commented 10 years ago

My functional tests no longer lock up geard, so I think it's good. Thank you!

openshift / geard

geard freezes when unit is started/stopped in quick succession #187