The Redis global locking approach used by waitlock.py may not be entirely reliable

jmurty commented 7 years ago

When deploying environments with multiple Django-based containers, all of which will try to do startup tasks like running DB migrations at the same time, we use Redis to acquire "global" locks before running the commands. This ensures that only one container at a time will run DB migrations, or other jobs that only need to be done once.

On a recent update of a staging environment the Redis global lock mechanism failed, causing 4 containers to try and run DB migrations at the same time, causing a large load on the underlying node (since DB migration processing is expensive) and bringing everything to a near standstill.

The specific Redis global lock failures were NotAcquired errors from the waitlock.py helper script, e.g.

Applying collection.0013_artist_getty_ulan... OK
2017-09-15T00:11:35.689447994Z Exception in thread Thread-1:
2017-09-15T00:11:36.565411097Z Traceback (most recent call last):
2017-09-15T00:11:36.600070938Z   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
2017-09-15T00:11:36.600086761Z     self.run()
2017-09-15T00:11:36.600093688Z   File "/usr/lib/python2.7/threading.py", line 754, in run
2017-09-15T00:11:36.600099677Z     self.__target(*self.__args, **self.__kwargs)
2017-09-15T00:11:36.600105972Z   File "/usr/local/lib/python2.7/dist-packages/redis_lock/__init__.py", line 281, in _lock_renewer
2017-09-15T00:11:36.600113957Z     lock.extend(expire=lock._expire)
2017-09-15T00:11:36.603459917Z   File "/usr/local/lib/python2.7/dist-packages/redis_lock/__init__.py", line 260, in extend
2017-09-15T00:11:36.603478481Z     raise NotAcquired("Lock %s is not acquired or it already expired." % self._name)
2017-09-15T00:11:36.603489576Z NotAcquired: Lock lock:migrate.sh /opt/sfmoma/var is not acquired or it already expired.
2017-09-15T00:11:36.604640229Z

We (IC) worked around the problem by stopping the less important containers in the sfmoma-staging stack that were also running migrations in Docker Cloud (celery, celeryflower, celerybeat) to allow just the one django container to do the work.

The root cause seemed to be failures within the redis_lock library when attempting to apply an EXTEND operation to extend an existing lock. 263-lock-renew-context Here is the extend() method in redis_lock: https://github.com/ionelmc/python-redis-lock/blob/369e95bb5e26284ef0944e551f93d9f2596e5345/src/redis_lock/__init__.py#L243

Ultimately, for some reason the EXTEND operation applied via scripting to the Redis locking mechanism 1 returned an errorcode value of 1. I do not know why, and have been unable to find any useful details or explanation with preliminary research.

See https://github.com/sfmoma/sfmoma/issues/263

mrmachine commented 7 years ago

I suspect this was caused by the server being under extremely high load the the lock expiry being set to 5 seconds, which under normal circumstances is plenty of time for the EXTENDS command to be frequently sent to Redis, and allows us to more quickly move on to the next container or retry on an actual crash/failure... Under extremely high load, it's possible that the thread sending EXTENDS commands was delayed for more than 5 seconds, causing it to fail with an error about the lock having not being acquired or having already expired. I think we should increase this to 60 seconds and see if the problem resurfaces.

https://github.com/ixc/ixc-django-docker/blob/master/ixc_django_docker/bin/waitlock.py#L81

jmurty commented 7 years ago

It might also be worth making the waitlock.py script more resilient to the NotAcquired error case. Perhaps we could retry the lock.acquire() method up to n times (3?) to avoid failing in this and other edge-case situations where a lock temporarily cannot be acquired or extended?

mrmachine commented 7 years ago

@jmurty I think this error won't happen when acquiring a lock, only when attempting to extend a lock while it is already acquired and the command is executing. I think it's probably OK to just expect the caller (Python/Bash script or Docker Cloud) to retry on failure. Docker services should already be configured to restart on-failure or always, anyway, and I think that is what was happening here?

Any retry option in waitlock.py should probably be enabled via command line arg rather than hard coded, and that seems like overkill to work around this edge case that is probably solved with a larger timeout. The real problem is probably the fact that Django migrations (or something else) was causing such extremely high load that the EXTENDS command was not able to execute.

ixc / ixc-django-docker

The Redis global locking approach used by waitlock.py may not be entirely reliable #3