sibson / redbeat

RedBeat is a Celery Beat Scheduler that stores the scheduled tasks and runtime metadata in Redis.
Apache License 2.0
909 stars 130 forks source link

LockError: Cannot release an unlocked lock #129

Open davidjmemmett opened 5 years ago

davidjmemmett commented 5 years ago

During shutdown of celery beat when using the redbeat scheduler, we have seen the following exceptions:

  File "celery/beat.py", line 592, in start
    time.sleep(interval)
  File "celery/apps/beat.py", line 154, in _sync
    service.sync()
  File "celery/beat.py", line 601, in sync
    self.scheduler.close()
  File "redbeat/schedulers.py", line 453, in close
    self.lock.release()
  File "redis/lock.py", line 135, in release
    self.do_release(expected_token)
  File "redis/lock.py", line 263, in do_release
    client=self.redis)):
  File "redis/client.py", line 2944, in __call__
    return client.evalsha(self.sha, len(keys), *args)
  File "redis/client.py", line 2079, in evalsha
    return self.execute_command('EVALSHA', sha, numkeys, *keys_and_args)
  File "redis/client.py", line 668, in execute_command
    return self.parse_response(connection, command_name, **options)
  File "redis/client.py", line 680, in parse_response
    response = connection.read_response()
  File "redis/connection.py", line 624, in read_response
    response = self._parser.read_response()
  File "redis/connection.py", line 284, in read_response
    response = self._buffer.readline()
  File "redis/connection.py", line 216, in readline
    self._read_from_socket()
  File "redis/connection.py", line 174, in _read_from_socket
    data = recv(self._sock, socket_read_size)
  File "redis/_compat.py", line 79, in recv
    return sock.recv(*args, **kwargs)
  File "celery/apps/beat.py", line 154, in _sync
    service.sync()
  File "celery/beat.py", line 601, in sync
    self.scheduler.close()
  File "redbeat/schedulers.py", line 453, in close
    self.lock.release()
  File "redis/lock.py", line 133, in release
    raise LockError("Cannot release an unlocked lock")

  File "python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "__main__.py", line 20, in <module>
    main()
  File "__main__.py", line 16, in main
    _main()
  File "celery/bin/celery.py", line 322, in main
    cmd.execute_from_commandline(argv)
  File "celery/bin/celery.py", line 496, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "celery/bin/base.py", line 275, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "celery/bin/celery.py", line 488, in handle_argv
    return self.execute(command, argv)
  File "celery/bin/celery.py", line 420, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "celery/bin/base.py", line 279, in run_from_argv
    sys.argv if argv is None else argv, command)
  File "celery/bin/base.py", line 363, in handle_argv
    return self(*args, **options)
  File "celery/bin/base.py", line 238, in __call__
    ret = self.run(*args, **kwargs)
  File "celery/bin/beat.py", line 109, in run
    return beat().run()
  File "celery/apps/beat.py", line 81, in run
    self.start_scheduler()
  File "celery/apps/beat.py", line 109, in start_scheduler
    service.start()
  File "celery/beat.py", line 598, in start
    self.sync()
  File "celery/beat.py", line 601, in sync
    self.scheduler.close()
  File "redbeat/schedulers.py", line 453, in close
    self.lock.release()
  File "redis/lock.py", line 133, in release
    raise LockError("Cannot release an unlocked lock")```
laixintao commented 5 years ago

It seems the lock is expired, redbeat PEXPIRE the lock key for every tick.

Can you show us your redbeat_lock_timeout and beat_max_loop_interval ?

davidjmemmett commented 5 years ago

They haven't been set, so I assume whatever the default is

berlotto commented 5 years ago

+1

tlinhart commented 4 years ago

I'm experiencing probably the same or similar behaviour. The setup is as follows. I have 3 instances of Dockerized Redbeat (version 0.13.0) running on 3 different hosts. Celery (version 4.3.0) logs to a file with INFO level and all the logs are shipped to ELK (via Filebeat). In the configuration (redbeat_redis_url), the Redis instance is defined using a hostname. Also, the redbeat_lock_timeout is set to 60 seconds and beat_max_loop_interval to 10 seconds.

Under normal circumstances, everything works fine. Rarely there is a DNS resolution issue, though, which results in Redbeat instance currently holding the lock unable to communicate to Redis. This doesn't however stop the instance in sending the tasks... In the meantime, one of the other two instances acquires the lock as well, which ultimately leads to two Celery beat running at the same time and sending the task.

Here's what I observe. On the standard output of the Redbeat instance which experiences the DNS issue, I see this messages (using docker logs):

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 601, in start
    time.sleep(interval)
  File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 154, in _sync
    service.sync()
  File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 610, in sync
    self.scheduler.close()
  File "/usr/local/lib/python3.6/site-packages/redbeat/schedulers.py", line 453, in close
    self.lock.release()
  File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 226, in release
    self.do_release(expected_token)
  File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 232, in do_release
    raise LockNotOwnedError("Cannot release a lock"
redis.exceptions.LockNotOwnedError: Cannot release a lock that's no longer owned

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/celery", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/celery/__main__.py", line 16, in main
    _main()
  File "/usr/local/lib/python3.6/site-packages/celery/bin/celery.py", line 322, in main
    cmd.execute_from_commandline(argv)
  File "/usr/local/lib/python3.6/site-packages/celery/bin/celery.py", line 496, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "/usr/local/lib/python3.6/site-packages/celery/bin/base.py", line 298, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "/usr/local/lib/python3.6/site-packages/celery/bin/celery.py", line 488, in handle_argv
    return self.execute(command, argv)
  File "/usr/local/lib/python3.6/site-packages/celery/bin/celery.py", line 420, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "/usr/local/lib/python3.6/site-packages/celery/bin/base.py", line 302, in run_from_argv
    sys.argv if argv is None else argv, command)
  File "/usr/local/lib/python3.6/site-packages/celery/bin/base.py", line 386, in handle_argv
    return self(*args, **options)
  File "/usr/local/lib/python3.6/site-packages/celery/bin/base.py", line 252, in __call__
    ret = self.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/celery/bin/beat.py", line 109, in run
    return beat().run()
  File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 81, in run
    self.start_scheduler()
  File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 109, in start_scheduler
    service.start()
  File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 607, in start
    self.sync()
  File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 610, in sync
    self.scheduler.close()
  File "/usr/local/lib/python3.6/site-packages/redbeat/schedulers.py", line 453, in close
    self.lock.release()
  File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 224, in release
    raise LockError("Cannot release an unlocked lock")
redis.exceptions.LockError: Cannot release an unlocked lock

Then, when restarting this Redbeat instance (which solves the problem), I see this message in the logs (in ELK/Kibana):

[2019-10-03 15:28:46,137: CRITICAL/MainProcess] beat raised exception <class 'redis.exceptions.LockError'>: LockError('Cannot release an unlocked lock',)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 601, in start
    time.sleep(interval)
  File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 154, in _sync
    service.sync()
  File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 610, in sync
    self.scheduler.close()
  File "/usr/local/lib/python3.6/site-packages/redbeat/schedulers.py", line 453, in close
    self.lock.release()
  File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 226, in release
    self.do_release(expected_token)
  File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 232, in do_release
    raise LockNotOwnedError("Cannot release a lock"
redis.exceptions.LockNotOwnedError: Cannot release a lock that's no longer owned

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 109, in start_scheduler
    service.start()
  File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 607, in start
    self.sync()
  File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 610, in sync
    self.scheduler.close()
  File "/usr/local/lib/python3.6/site-packages/redbeat/schedulers.py", line 453, in close
    self.lock.release()
  File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 224, in release
    raise LockError("Cannot release an unlocked lock")
redis.exceptions.LockError: Cannot release an unlocked lock

I haven't looked into the code yet, but maybe someone can guide me in the meantime. Where's the problem? Why is the original Redbeat instance still sending the tasks when it's unable to reach the Redis instance?

EDIT: It's quite to simulate the incorrect behavior of two instances thinking they both hold the lock:

  1. Run two instances of the same Docker image on the same host (naming the containers e.g. beat1 and beat2). Only one of them (let's say beat1) acquires the log while the other one (beat2) hangs on
    [2019-10-03 17:24:08,565: DEBUG/MainProcess] beat: Acquiring lock...
  2. Now disconnect beat1 from network using docker network disconnect bridge beat1. I'll start hanging on
    [2019-10-03 18:00:40,848: DEBUG/MainProcess] beat: Extending lock...
  3. After same time, beat2 takes on the work starting with logging
    [2019-10-03 17:33:19,090: DEBUG/MainProcess] beat: Extending lock...
    [2019-10-03 17:33:19,093: DEBUG/MainProcess] Selecting tasks
    ...
  4. After some time, reconnect beat1 to the network using docker network connect bridge beat1.
  5. After some time, beat1 starts to load tasks as if it was holding the lock while logging
    [2019-10-03 18:04:14,795: DEBUG/MainProcess] Selecting tasks
    [2019-10-03 18:04:14,797: INFO/MainProcess] Loading 0 tasks
    [2019-10-03 18:04:14,798: DEBUG/MainProcess] beat: Waking up in 10.00 seconds.
    [2019-10-03 18:04:24,805: DEBUG/MainProcess] beat: Synchronizing schedule...
    [2019-10-03 18:04:24,806: DEBUG/MainProcess] beat: Extending lock...
    [2019-10-03 18:04:24,808: DEBUG/MainProcess] Selecting tasks
    ...

This is different kind of problem I guess, so maybe I should open a new issue?

EDIT 2: Tested from master branch and it works. More precisely, it doesn't fall into the incorrect state - while extending the lock on the instance that has just became connected to the network again, it raises an exception that it can't extend lock which is not owned. This works in my setup since I run it inside a Docker with --restart=always. When the Docker container restarts, the instance hang at Acquiring lock... which is a correct behavior in the end.

rakeeb-hossain commented 2 years ago

Bump. I am facing this issue and it oddly seems to be leading to loss of data on my single-instance Redis server. Has there been any headway on this?

seocam commented 2 years ago

I'm getting the same error but when my redis server restarts. I suppose that's a different problem, right? Should I open another issue?

nigel-gott commented 2 years ago

I recently encountered this error much more frequently when the redis python library was accidentally downgraded from 4.1.4 to 3. Might be related.

nigel-gott commented 2 years ago

I have debugged this further and think it is caused by a race condition in schedulers.RedBeatScheduler.close and the celery beat signal handler. I stopped seeing this error when I killed my Redbeat using SIGQUIT instead of SIGTERM, which avoids calling the beat signal handler. Hopefully this works for others having this issue.

My best guess at what is going on is:

  1. Celery beat registers a signal handler on SIGTERM SIGINT which calls service.sync() which eventually will call schedulers.RedBeatScheduler.close. This happens on the main celery beat thread as that is where signal handlers execute.
  2. On a separate thread perhaps the beat.Service is running celery.beat.Service.start which also crashes out at the same time as the above call to sync and ends up calling:
        finally:
            self.sync()
  1. Both threads get past the if self.lock guard in schedulers.RedBeatScheduler.close as the redis self.lock.release call is possibly slow and blocking (I see in the logs two release calls, presumably from the two racing threads)
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] [2022-03-02 20:01:01,854: DEBUG/MainProcess] beat: Releasing Lock
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] [2022-03-02 20:01:01,854: DEBUG/MainProcess] beat: Releasing Lock
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] [2022-03-02 20:01:01,854: CRITICAL/MainProcess] beat raised exception <class 'redis.exceptions.LockError'>: LockError('Cannot release an unlocked lock')
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] Traceback (most recent call last):
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/celery/beat.py", line 649, in start
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     time.sleep(interval)
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/celery/apps/beat.py", line 150, in _sync
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     service.sync()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/celery/beat.py", line 658, in sync
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     self.scheduler.close()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/redbeat/schedulers.py", line 482, in close
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     self.lock.release()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/redis/lock.py", line 237, in release
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     self.do_release(expected_token)
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/redis/lock.py", line 243, in do_release
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     raise LockNotOwnedError("Cannot release a lock" " that's no longer owned")
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] redis.exceptions.LockNotOwnedError: Cannot release a lock that's no longer owned
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] 
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] During handling of the above exception, another exception occurred:
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] 
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] Traceback (most recent call last):
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/celery/apps/beat.py", line 105, in start_scheduler
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     service.start()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/celery/beat.py", line 655, in start
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     self.sync()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/celery/beat.py", line 658, in sync
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     self.scheduler.close()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/redbeat/schedulers.py", line 482, in close
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     self.lock.release()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]   File "/baserow/venv/lib/python3.7/site-packages/redis/lock.py", line 235, in release
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]     raise LockError("Cannot release an unlocked lock")
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] redis.exceptions.LockError: Cannot release an unlocked lock
  1. One thread wins, releases the lock first. The other thread loses, tries to release the lock a second time, we get the error.

By instead using SIGQUIT we ensure the SIGINT and SIGTERM celery beat signal handler doesn't get called and only one thread does the cleanup. Feels a bit hacky doing it this way but it works for me!

nicklyra commented 1 year ago

I have discovered you can also encounter this problem with a running celery beat job if you set celery.conf.beat_max_loop_interval to any value >= celery.conf.redbeat_lock_timeout, probably because the main loop wakes up only every beat_max_loop_interval seconds to extend the lock, but if redbeat_lock_timeout is shorter than that value (plus the time it takes to extend the lock), the lock times out before it can be extended.

To avoid this, it may help to have the main loop wake up every min(beat_max_loop_interval, redbeat_lock_timeout) instead, or just be careful when setting those values if you do.