Open TauPan opened 3 years ago
There is a case I just saw where the sequence:
if lock.locked():
lock.release()
lead to NotAcquired
. The lock was created with expire=250
and auto_renewal=True
by a modified version of the code mentioned above:
@contextmanager
def lock_or_fail(key, expire=60):
import redis_lock
lock = redis_lock.Lock(get_redis(), name=key, expire=expire, auto_renewal=True)
if lock.acquire(blocking=False):
try:
yield
finally:
if lock.locked():
lock.release()
else:
raise Ignore(f'Could not acquire lock for {key}')
It seems there are cases where auto_renewal does not work at all. Could system load play a role here?
Regarding system load:
Server time of the error is: Wed, 17 Mar 2021 04:37:42 +0100
a vmstat log from around that time shows around 4 to 6 processes waiting for io around that time while a redis persist took unusually long:
vmstat:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
r b swpd free buff cache si so bi bo in cs us sy id wa st CET
0 5 5606152 211884 0 473552 178 0 31535 173 3908 4681 33 5 19 43 0 2021-03-17 04:37:02
1 5 5605896 194356 0 487448 110 0 34938 111 3578 4216 27 4 19 49 0 2021-03-17 04:37:07
1 5 5605128 199356 0 478088 118 0 37834 617 3961 4775 35 5 17 43 0 2021-03-17 04:37:12
1 6 5605128 179020 0 494164 86 0 36410 58 4012 4911 29 5 13 53 0 2021-03-17 04:37:17
2 4 5605128 186804 0 480912 77 0 36622 37 3843 4520 30 5 14 51 0 2021-03-17 04:37:22
1 6 5604872 212164 0 439200 102 0 44735 126 3049 3792 15 4 21 60 0 2021-03-17 04:37:27
4 4 5604872 278004 0 546664 98 0 21476 2 3623 4552 24 5 24 47 0 2021-03-17 04:37:32
2 4 5604872 217044 0 603592 93 0 24533 78 4412 4864 41 5 11 42 0 2021-03-17 04:37:37
3 5 5604628 201288 0 615224 161 2 29806 92 3293 3287 31 5 9 55 0 2021-03-17 04:37:42
2 6 5603860 185128 0 630036 237 0 36338 533 3584 3466 37 6 7 50 0 2021-03-17 04:37:47
1 5 5603348 223440 0 648424 186 0 26106 55 3065 3178 29 5 10 56 0 2021-03-17 04:37:52
1 4 5602836 202532 0 663344 211 0 8783 26 2982 3171 28 4 10 57 0 2021-03-17 04:37:57
redis.log:
834:M 17 Mar 04:26:47.149 * 10 changes in 300 seconds. Saving...
834:M 17 Mar 04:26:47.373 * Background saving started by pid 19480
19480:C 17 Mar 04:55:58.369 * DB saved on disk
19480:C 17 Mar 04:55:58.472 * RDB: 12 MB of memory used by copy-on-write
834:M 17 Mar 04:55:58.587 * Background saving terminated with success
While rabbitmq has no entries around that time, the celery master worker reports missed heartbeat from files@...
(which is the worker queue that's running the task failing to unlock itself) before and after the error.
So it's safe to say that things took unusually long. I don't know if redis is still responsive while saving, if not this would definitely explain why the lock could not be renewed in time. (And it could also explain a very long time between the call to lock.locked()
and lock.release()
.
So my guess would be that this is most likely not a bug in python-redis-lock as things just take too long... (And that wouldn't be a problem in production for me either.)
Feel free to close if you agree with that last assessment. I'm going to set higher expiry times, which should also work in production. I don't want to workaround too much of this in production code, so I'm going to live with an occasional occurance of this on my development VMs.
What sort of work do you do in the tasks? The lock auto renewal runs in a thread - if you'd have the main thread hold the GIL with something it could fail to renew, yes. But that seems so unlikely...
Maybe there's some connection management issue (eg: multiple things going on on the same connection instance, thus the occasional failure)? Does get_redis
reuse connections used for other things?
What sort of work do you do in the tasks? The lock auto renewal runs in a thread - if you'd have the main thread hold the GIL with something it could fail to renew, yes. But that seems so unlikely...
The tasks in question do lots of database queries (mostly selects, but some updates as well) in a transaction (django) and dump the contents as gzipped json.
Database layer is psycopg2. I think that this doesn't hold a GIL, but I don't know for sure.
In any case increasing the expire timeout has helped for those tasks. (I didn't even have to get close to the running time, it was enough to raise it to something short of 5 minutes.)
Also I'm very sure that multiple processes were blocking because of IO (swap because of tight memory conditions or network). Since I've moved the VM to a different host with more memory I didn't see the exceptions any more.
Maybe there's some connection management issue (eg: multiple things going on on the same connection instance, thus the occasional failure)? Does
get_redis
reuse connections used for other things?
In the expected case it does use the connection pool from celery/kombu (in the fallback case it won't).
If I interpret this correctly https://docs.celeryproject.org/projects/kombu/en/stable/userguide/pools.html there's a pre-set limit of the number of connections in the pool:
In [1]: from kombu import pools
In [2]: pools.get_limit()
Out[2]: 200
However we never have as much as 200 workers running and there are no other clients on the redis server, so this should be more than enough. I hope that answers your second question?
psycopg2 may in fact have a problem with it's default waiter implementation, see: https://www.psycopg.org/docs/extensions.html#psycopg2.extensions.set_wait_callback
I've experienced problems with the default waiter in one project (those time limit signals weren't handled). Pretty sure psycopg2 holds the GIL with the default waiter :-)
psycopg2.extensions.set_wait_callback(pyscopg2.extras.wait_select)
may be enough for you, or you may need to copy off that wait_select code and tweak it a bit around the error paths.
I've experienced problems with the default waiter in one project (those time limit signals weren't handled). Pretty sure psycopg2 holds the GIL with the default waiter :-)
psycopg2.extensions.set_wait_callback(pyscopg2.extras.wait_select)
may be enough for you, or you may need to copy off that wait_select code and tweak it a bit around the error paths.
Wow, thanks for the hint! I'll see if those errors resurface and consider that workaround if they do.
Edit: In fact the documentation mentions Eventlet and I was considering switching to that (currently we use the multiprocessing backend) and it definitely looks like that might be help if I do. (Currently there's no demand for that though and the loads look like Eventlet might not help much.)
Hi!
I've (sort of) successfully implemented an alternative locking mechanism for celery tasks (see https://docs.celeryproject.org/en/3.1/tutorials/task-cookbook.html#cookbook-task-serial ) with python-redis-lock and it works like a charm (again: sort of), except that I was getting some NotAcquired exceptions from tasks that run longer (a couple of minutes and in some cases hours).
Note that I'm not sure if I merely need assistance or if this is an actual bug.
CentOS 7 (core) linux kernel 3.10.0 python-redis-lock: 3.7.0 redis: 3.2.12 python 3.6.8 django 2.2.14 celery 3.1.18
I've implemented a context manager that acquires the locks:
(On my setup,
get_redis()
always returns the celery result backend, which happens to be an instance ofStrictRedis
if that might be an issue.)A celery task uses this as follows:
This mostly works, however I have some tasks that can run a couple of minutes and (very few) tasks that can run for hours.
For those I got a couple of NotAcquired exceptions: "Lock footask_ is not acquired or it already expired." which are raised from the
release()
call above.I didn't figure out if the exceptions were raised while the machine was under high load or tight memory conditions. It's running on a VM which recently has been quite laggy. (Also I didn't figure out why, yet.) Redis runs on the same host however.
I think using the Lock as a context manager (as in https://python-redis-lock.readthedocs.io/en/latest/readme.html#troubleshooting ) probably won't help here, since it's basically the same code as in my context manager.
As a workaround I'll probably check the lock with
lock.locked()
but wondering if that's the right thing here.Should I raise the expiry time so it's more close to the order of the expected runtime of the task? Would
signal_expire
help here?Did I make some other mistake you can spot? Any information I left out or that I need to check?
Best regards