ic-labs / django-icekit

GLAMkit is a next-generation Python CMS by the Interaction Consortium, designed especially for the cultural sector.
http://glamkit.com
MIT License
47 stars 11 forks source link

(ACMI) Deployments often halt waiting for lock #135

Open cogat opened 7 years ago

cogat commented 7 years ago

Waits indefinitely even though there are no tasks waiting to be run. I have to go in to redis-cli and delete all the keys.

It would be great to help troubleshoot if:

mrmachine commented 7 years ago

@cogat there is a release timeout. The cronlock default is 1 day. I've lowered it to 1 hour in our waitlock.sh wrapper. However, I think that any timeout is either going to be too low (risky), or too long (inconvenient) and thus requiring the same level of manual intervention to fix.

If a problem has occurred that causes a lock to remain open indefinitely (or for a long time), simply expiring it with a short timeout and trying again might often result in the same outcome -- another lock that is stuck open.

Instead, we could try having a single setup service that only ever runs one container, and once it has finished starts reporting a health status. All the other services could then wait for the setup service to report an OK health status during their startup.

This might be easier once Docker Cloud is updated to Docker Engine 1.12 which supports a new health check feature.

Alternatively, we could try switching away from our cronlock (Bash) based wrapper to a python-redis-lock based wrapper. python-redis-lock has an option where we can set a low expiry (e.g. 60s), and then keep updating it as long as the process is still running (presumably in a thread).

This is probably a more straightforward change, and something we can try right away.

mrmachine commented 7 years ago

@cogat In the meantime, you can:

The reason we first stop all services instead of simply redeploying the whole stack is that (I suspect):

cogat commented 7 years ago

@mrmachine is this ticket still needed?

mrmachine commented 7 years ago

@cogat Yes. I'd like to try switch to python-redis-lock which should be a relatively straightforward swap. It might not solve the problem (where something kills a setup process while a lock is open) but should make it easier to recover by having a 60s expiration on the locks, if the setup process is killed.