NotAcquired with longer celery jobs with short expire and auto_renewal=True

TauPan commented 3 years ago

Hi!

I've (sort of) successfully implemented an alternative locking mechanism for celery tasks (see https://docs.celeryproject.org/en/3.1/tutorials/task-cookbook.html#cookbook-task-serial ) with python-redis-lock and it works like a charm (again: sort of), except that I was getting some NotAcquired exceptions from tasks that run longer (a couple of minutes and in some cases hours).

Note that I'm not sure if I merely need assistance or if this is an actual bug.

CentOS 7 (core) linux kernel 3.10.0 python-redis-lock: 3.7.0 redis: 3.2.12 python 3.6.8 django 2.2.14 celery 3.1.18

I've implemented a context manager that acquires the locks:

@contextmanager
def lock_or_fail(key):
    lock = redis_lock.Lock(get_redis(), name=key, expire=60, auto_renewal=True)
    if lock.acquire(blocking=False):
        try:
            yield
        finally:
            lock.release()
    else:
        raise Ignore(f'Could not acquire lock for {key}')

@contextmanager
def lock_task(task, check_kwargs=False):
    with lock_or_fail(key_for_task(task, check_kwargs=check_kwargs)):
        yield

def get_redis():
    celery_result_backend_client = my_app.celery_app.backend.client
    if isinstance(celery_result_backend_client, Redis):
        return celery_result_backend_client
    else:
        return StrictRedis()

(On my setup, get_redis() always returns the celery result backend, which happens to be an instance of StrictRedis if that might be an issue.)

A celery task uses this as follows:

@shared_task(bind=True)
def do_something(self):
    with lock_task(self):
      do_some_other_stuff()

This mostly works, however I have some tasks that can run a couple of minutes and (very few) tasks that can run for hours.

For those I got a couple of NotAcquired exceptions: "Lock footask_ is not acquired or it already expired." which are raised from the release() call above.

I didn't figure out if the exceptions were raised while the machine was under high load or tight memory conditions. It's running on a VM which recently has been quite laggy. (Also I didn't figure out why, yet.) Redis runs on the same host however.

I think using the Lock as a context manager (as in https://python-redis-lock.readthedocs.io/en/latest/readme.html#troubleshooting ) probably won't help here, since it's basically the same code as in my context manager.

As a workaround I'll probably check the lock with lock.locked() but wondering if that's the right thing here.

Should I raise the expiry time so it's more close to the order of the expected runtime of the task? Would signal_expire help here?

Did I make some other mistake you can spot? Any information I left out or that I need to check?

Best regards

TauPan commented 3 years ago

There is a case I just saw where the sequence:

if lock.locked():
    lock.release()

lead to NotAcquired. The lock was created with expire=250 and auto_renewal=True by a modified version of the code mentioned above:

@contextmanager
def lock_or_fail(key, expire=60):
    import redis_lock
    lock = redis_lock.Lock(get_redis(), name=key, expire=expire, auto_renewal=True)
    if lock.acquire(blocking=False):
        try:
            yield
        finally:
            if lock.locked():
                lock.release()
    else:
        raise Ignore(f'Could not acquire lock for {key}')

It seems there are cases where auto_renewal does not work at all. Could system load play a role here?

TauPan commented 3 years ago

Regarding system load:

Server time of the error is: Wed, 17 Mar 2021 04:37:42 +0100

a vmstat log from around that time shows around 4 to 6 processes waiting for io around that time while a redis persist took unusually long:

vmstat:

  procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
   r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st                 CET
   0  5 5606152 211884      0 473552  178    0 31535   173 3908 4681 33  5 19 43  0 2021-03-17 04:37:02
   1  5 5605896 194356      0 487448  110    0 34938   111 3578 4216 27  4 19 49  0 2021-03-17 04:37:07
   1  5 5605128 199356      0 478088  118    0 37834   617 3961 4775 35  5 17 43  0 2021-03-17 04:37:12
   1  6 5605128 179020      0 494164   86    0 36410    58 4012 4911 29  5 13 53  0 2021-03-17 04:37:17
   2  4 5605128 186804      0 480912   77    0 36622    37 3843 4520 30  5 14 51  0 2021-03-17 04:37:22
   1  6 5604872 212164      0 439200  102    0 44735   126 3049 3792 15  4 21 60  0 2021-03-17 04:37:27
   4  4 5604872 278004      0 546664   98    0 21476     2 3623 4552 24  5 24 47  0 2021-03-17 04:37:32
   2  4 5604872 217044      0 603592   93    0 24533    78 4412 4864 41  5 11 42  0 2021-03-17 04:37:37
   3  5 5604628 201288      0 615224  161    2 29806    92 3293 3287 31  5  9 55  0 2021-03-17 04:37:42
   2  6 5603860 185128      0 630036  237    0 36338   533 3584 3466 37  6  7 50  0 2021-03-17 04:37:47
   1  5 5603348 223440      0 648424  186    0 26106    55 3065 3178 29  5 10 56  0 2021-03-17 04:37:52
   1  4 5602836 202532      0 663344  211    0  8783    26 2982 3171 28  4 10 57  0 2021-03-17 04:37:57

redis.log:

834:M 17 Mar 04:26:47.149 * 10 changes in 300 seconds. Saving...
834:M 17 Mar 04:26:47.373 * Background saving started by pid 19480
19480:C 17 Mar 04:55:58.369 * DB saved on disk
19480:C 17 Mar 04:55:58.472 * RDB: 12 MB of memory used by copy-on-write
834:M 17 Mar 04:55:58.587 * Background saving terminated with success

While rabbitmq has no entries around that time, the celery master worker reports missed heartbeat from files@... (which is the worker queue that's running the task failing to unlock itself) before and after the error.

So it's safe to say that things took unusually long. I don't know if redis is still responsive while saving, if not this would definitely explain why the lock could not be renewed in time. (And it could also explain a very long time between the call to lock.locked() and lock.release().

So my guess would be that this is most likely not a bug in python-redis-lock as things just take too long... (And that wouldn't be a problem in production for me either.)

Feel free to close if you agree with that last assessment. I'm going to set higher expiry times, which should also work in production. I don't want to workaround too much of this in production code, so I'm going to live with an occasional occurance of this on my development VMs.

ionelmc commented 3 years ago

What sort of work do you do in the tasks? The lock auto renewal runs in a thread - if you'd have the main thread hold the GIL with something it could fail to renew, yes. But that seems so unlikely...

Maybe there's some connection management issue (eg: multiple things going on on the same connection instance, thus the occasional failure)? Does get_redis reuse connections used for other things?

TauPan commented 3 years ago

What sort of work do you do in the tasks? The lock auto renewal runs in a thread - if you'd have the main thread hold the GIL with something it could fail to renew, yes. But that seems so unlikely...

The tasks in question do lots of database queries (mostly selects, but some updates as well) in a transaction (django) and dump the contents as gzipped json.

Database layer is psycopg2. I think that this doesn't hold a GIL, but I don't know for sure.

In any case increasing the expire timeout has helped for those tasks. (I didn't even have to get close to the running time, it was enough to raise it to something short of 5 minutes.)

Also I'm very sure that multiple processes were blocking because of IO (swap because of tight memory conditions or network). Since I've moved the VM to a different host with more memory I didn't see the exceptions any more.

Maybe there's some connection management issue (eg: multiple things going on on the same connection instance, thus the occasional failure)? Does get_redis reuse connections used for other things?

In the expected case it does use the connection pool from celery/kombu (in the fallback case it won't).

If I interpret this correctly https://docs.celeryproject.org/projects/kombu/en/stable/userguide/pools.html there's a pre-set limit of the number of connections in the pool:

In [1]: from kombu import pools

In [2]: pools.get_limit()
Out[2]: 200

However we never have as much as 200 workers running and there are no other clients on the redis server, so this should be more than enough. I hope that answers your second question?

ionelmc commented 3 years ago

psycopg2 may in fact have a problem with it's default waiter implementation, see: https://www.psycopg.org/docs/extensions.html#psycopg2.extensions.set_wait_callback

ionelmc commented 3 years ago

I've experienced problems with the default waiter in one project (those time limit signals weren't handled). Pretty sure psycopg2 holds the GIL with the default waiter :-)

psycopg2.extensions.set_wait_callback(pyscopg2.extras.wait_select) may be enough for you, or you may need to copy off that wait_select code and tweak it a bit around the error paths.

TauPan commented 3 years ago

I've experienced problems with the default waiter in one project (those time limit signals weren't handled). Pretty sure psycopg2 holds the GIL with the default waiter :-)

psycopg2.extensions.set_wait_callback(pyscopg2.extras.wait_select) may be enough for you, or you may need to copy off that wait_select code and tweak it a bit around the error paths.

Wow, thanks for the hint! I'll see if those errors resurface and consider that workaround if they do.

Edit: In fact the documentation mentions Eventlet and I was considering switching to that (currently we use the multiprocessing backend) and it definitely looks like that might be help if I do. (Currently there's no demand for that though and the loads look like Eventlet might not help much.)

ionelmc / python-redis-lock

NotAcquired with longer celery jobs with short expire and auto_renewal=True #85