(ACMI) Deployments often halt waiting for lock

cogat commented 7 years ago

Waits indefinitely even though there are no tasks waiting to be run. I have to go in to redis-cli and delete all the keys.

It would be great to help troubleshoot if:

the key name could indicate a bit more about which command issued the key
keys timed out after say 10 mins

mrmachine commented 7 years ago

@cogat there is a release timeout. The cronlock default is 1 day. I've lowered it to 1 hour in our waitlock.sh wrapper. However, I think that any timeout is either going to be too low (risky), or too long (inconvenient) and thus requiring the same level of manual intervention to fix.

If a problem has occurred that causes a lock to remain open indefinitely (or for a long time), simply expiring it with a short timeout and trying again might often result in the same outcome -- another lock that is stuck open.

Instead, we could try having a single setup service that only ever runs one container, and once it has finished starts reporting a health status. All the other services could then wait for the setup service to report an OK health status during their startup.

This might be easier once Docker Cloud is updated to Docker Engine 1.12 which supports a new health check feature.

Alternatively, we could try switching away from our cronlock (Bash) based wrapper to a python-redis-lock based wrapper. python-redis-lock has an option where we can set a low expiry (e.g. 60s), and then keep updating it as long as the process is still running (presumably in a thread).

This is probably a more straightforward change, and something we can try right away.

mrmachine commented 7 years ago

@cogat In the meantime, you can:

Stop all services
Redeploy only the redis service and do not reuse existing volumes, to clear all redis databases
Start all remaining services

The reason we first stop all services instead of simply redeploying the whole stack is that (I suspect):

redeploying the whole stack (and not reusing existing volumes) will redeploy redis first, which will free up any locks
one of the other services that are still running (waiting to be redeployed sequentially) will immediately acquire a lock
the other services will then be redeployed while holding an open lock, potentially leaving another dangling open lock behind

cogat commented 7 years ago

@mrmachine is this ticket still needed?

mrmachine commented 7 years ago

@cogat Yes. I'd like to try switch to python-redis-lock which should be a relatively straightforward swap. It might not solve the problem (where something kills a setup process while a lock is open) but should make it easier to recover by having a 60s expiration on the locks, if the setup process is killed.

ic-labs / django-icekit

(ACMI) Deployments often halt waiting for lock #135