Open cogat opened 7 years ago
@cogat there is a release timeout. The cronlock
default is 1 day. I've lowered it to 1 hour in our waitlock.sh
wrapper. However, I think that any timeout is either going to be too low (risky), or too long (inconvenient) and thus requiring the same level of manual intervention to fix.
If a problem has occurred that causes a lock to remain open indefinitely (or for a long time), simply expiring it with a short timeout and trying again might often result in the same outcome -- another lock that is stuck open.
Instead, we could try having a single setup
service that only ever runs one container, and once it has finished starts reporting a health status. All the other services could then wait for the setup
service to report an OK health status during their startup.
This might be easier once Docker Cloud is updated to Docker Engine 1.12 which supports a new health check feature.
Alternatively, we could try switching away from our cronlock
(Bash) based wrapper to a python-redis-lock
based wrapper. python-redis-lock
has an option where we can set a low expiry (e.g. 60s), and then keep updating it as long as the process is still running (presumably in a thread).
This is probably a more straightforward change, and something we can try right away.
@cogat In the meantime, you can:
redis
service and do not reuse existing volumes, to clear all redis databasesThe reason we first stop all services instead of simply redeploying the whole stack is that (I suspect):
@mrmachine is this ticket still needed?
@cogat Yes. I'd like to try switch to python-redis-lock
which should be a relatively straightforward swap. It might not solve the problem (where something kills a setup process while a lock is open) but should make it easier to recover by having a 60s expiration on the locks, if the setup process is killed.
Waits indefinitely even though there are no tasks waiting to be run. I have to go in to redis-cli and delete all the keys.
It would be great to help troubleshoot if: