Open jmurty opened 7 years ago
I suspect this was caused by the server being under extremely high load the the lock expiry being set to 5 seconds, which under normal circumstances is plenty of time for the EXTENDS command to be frequently sent to Redis, and allows us to more quickly move on to the next container or retry on an actual crash/failure... Under extremely high load, it's possible that the thread sending EXTENDS commands was delayed for more than 5 seconds, causing it to fail with an error about the lock having not being acquired or having already expired. I think we should increase this to 60 seconds and see if the problem resurfaces.
https://github.com/ixc/ixc-django-docker/blob/master/ixc_django_docker/bin/waitlock.py#L81
It might also be worth making the waitlock.py script more resilient to the NotAcquired
error case. Perhaps we could retry the lock.acquire()
method up to n times (3?) to avoid failing in this and other edge-case situations where a lock temporarily cannot be acquired or extended?
@jmurty I think this error won't happen when acquiring a lock, only when attempting to extend a lock while it is already acquired and the command is executing. I think it's probably OK to just expect the caller (Python/Bash script or Docker Cloud) to retry on failure. Docker services should already be configured to restart on-failure
or always
, anyway, and I think that is what was happening here?
Any retry option in waitlock.py
should probably be enabled via command line arg rather than hard coded, and that seems like overkill to work around this edge case that is probably solved with a larger timeout. The real problem is probably the fact that Django migrations (or something else) was causing such extremely high load that the EXTENDS
command was not able to execute.
When deploying environments with multiple Django-based containers, all of which will try to do startup tasks like running DB migrations at the same time, we use Redis to acquire "global" locks before running the commands. This ensures that only one container at a time will run DB migrations, or other jobs that only need to be done once.
On a recent update of a staging environment the Redis global lock mechanism failed, causing 4 containers to try and run DB migrations at the same time, causing a large load on the underlying node (since DB migration processing is expensive) and bringing everything to a near standstill.
The specific Redis global lock failures were
NotAcquired
errors from the waitlock.py helper script, e.g.We (IC) worked around the problem by stopping the less important containers in the sfmoma-staging stack that were also running migrations in Docker Cloud (celery, celeryflower, celerybeat) to allow just the one django container to do the work.
The root cause seemed to be failures within the
redis_lock
library when attempting to apply anEXTEND
operation to extend an existing lock. Here is theextend()
method inredis_lock
: https://github.com/ionelmc/python-redis-lock/blob/369e95bb5e26284ef0944e551f93d9f2596e5345/src/redis_lock/__init__.py#L243Ultimately, for some reason the
EXTEND
operation applied via scripting to the Redis locking mechanism 1 returned an errorcode value of 1. I do not know why, and have been unable to find any useful details or explanation with preliminary research.See https://github.com/sfmoma/sfmoma/issues/263