Open davidjmemmett opened 5 years ago
It seems the lock is expired, redbeat PEXPIRE
the lock key for every tick.
Can you show us your redbeat_lock_timeout
and beat_max_loop_interval
?
They haven't been set, so I assume whatever the default is
+1
I'm experiencing probably the same or similar behaviour. The setup is as follows. I have 3 instances of Dockerized Redbeat (version 0.13.0) running on 3 different hosts. Celery (version 4.3.0) logs to a file with INFO
level and all the logs are shipped to ELK (via Filebeat). In the configuration (redbeat_redis_url
), the Redis instance is defined using a hostname. Also, the redbeat_lock_timeout
is set to 60 seconds and beat_max_loop_interval
to 10 seconds.
Under normal circumstances, everything works fine. Rarely there is a DNS resolution issue, though, which results in Redbeat instance currently holding the lock unable to communicate to Redis. This doesn't however stop the instance in sending the tasks... In the meantime, one of the other two instances acquires the lock as well, which ultimately leads to two Celery beat running at the same time and sending the task.
Here's what I observe. On the standard output of the Redbeat instance which experiences the DNS issue, I see this messages (using docker logs
):
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 601, in start
time.sleep(interval)
File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 154, in _sync
service.sync()
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 610, in sync
self.scheduler.close()
File "/usr/local/lib/python3.6/site-packages/redbeat/schedulers.py", line 453, in close
self.lock.release()
File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 226, in release
self.do_release(expected_token)
File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 232, in do_release
raise LockNotOwnedError("Cannot release a lock"
redis.exceptions.LockNotOwnedError: Cannot release a lock that's no longer owned
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/celery", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/celery/__main__.py", line 16, in main
_main()
File "/usr/local/lib/python3.6/site-packages/celery/bin/celery.py", line 322, in main
cmd.execute_from_commandline(argv)
File "/usr/local/lib/python3.6/site-packages/celery/bin/celery.py", line 496, in execute_from_commandline
super(CeleryCommand, self).execute_from_commandline(argv)))
File "/usr/local/lib/python3.6/site-packages/celery/bin/base.py", line 298, in execute_from_commandline
return self.handle_argv(self.prog_name, argv[1:])
File "/usr/local/lib/python3.6/site-packages/celery/bin/celery.py", line 488, in handle_argv
return self.execute(command, argv)
File "/usr/local/lib/python3.6/site-packages/celery/bin/celery.py", line 420, in execute
).run_from_argv(self.prog_name, argv[1:], command=argv[0])
File "/usr/local/lib/python3.6/site-packages/celery/bin/base.py", line 302, in run_from_argv
sys.argv if argv is None else argv, command)
File "/usr/local/lib/python3.6/site-packages/celery/bin/base.py", line 386, in handle_argv
return self(*args, **options)
File "/usr/local/lib/python3.6/site-packages/celery/bin/base.py", line 252, in __call__
ret = self.run(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/celery/bin/beat.py", line 109, in run
return beat().run()
File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 81, in run
self.start_scheduler()
File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 109, in start_scheduler
service.start()
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 607, in start
self.sync()
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 610, in sync
self.scheduler.close()
File "/usr/local/lib/python3.6/site-packages/redbeat/schedulers.py", line 453, in close
self.lock.release()
File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 224, in release
raise LockError("Cannot release an unlocked lock")
redis.exceptions.LockError: Cannot release an unlocked lock
Then, when restarting this Redbeat instance (which solves the problem), I see this message in the logs (in ELK/Kibana):
[2019-10-03 15:28:46,137: CRITICAL/MainProcess] beat raised exception <class 'redis.exceptions.LockError'>: LockError('Cannot release an unlocked lock',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 601, in start
time.sleep(interval)
File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 154, in _sync
service.sync()
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 610, in sync
self.scheduler.close()
File "/usr/local/lib/python3.6/site-packages/redbeat/schedulers.py", line 453, in close
self.lock.release()
File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 226, in release
self.do_release(expected_token)
File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 232, in do_release
raise LockNotOwnedError("Cannot release a lock"
redis.exceptions.LockNotOwnedError: Cannot release a lock that's no longer owned
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/apps/beat.py", line 109, in start_scheduler
service.start()
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 607, in start
self.sync()
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 610, in sync
self.scheduler.close()
File "/usr/local/lib/python3.6/site-packages/redbeat/schedulers.py", line 453, in close
self.lock.release()
File "/usr/local/lib/python3.6/site-packages/redis/lock.py", line 224, in release
raise LockError("Cannot release an unlocked lock")
redis.exceptions.LockError: Cannot release an unlocked lock
I haven't looked into the code yet, but maybe someone can guide me in the meantime. Where's the problem? Why is the original Redbeat instance still sending the tasks when it's unable to reach the Redis instance?
EDIT: It's quite to simulate the incorrect behavior of two instances thinking they both hold the lock:
beat1
and beat2
). Only one of them (let's say beat1
) acquires the log while the other one (beat2
) hangs on
[2019-10-03 17:24:08,565: DEBUG/MainProcess] beat: Acquiring lock...
beat1
from network using docker network disconnect bridge beat1
. I'll start hanging on
[2019-10-03 18:00:40,848: DEBUG/MainProcess] beat: Extending lock...
beat2
takes on the work starting with logging
[2019-10-03 17:33:19,090: DEBUG/MainProcess] beat: Extending lock...
[2019-10-03 17:33:19,093: DEBUG/MainProcess] Selecting tasks
...
beat1
to the network using docker network connect bridge beat1
.beat1
starts to load tasks as if it was holding the lock while logging
[2019-10-03 18:04:14,795: DEBUG/MainProcess] Selecting tasks
[2019-10-03 18:04:14,797: INFO/MainProcess] Loading 0 tasks
[2019-10-03 18:04:14,798: DEBUG/MainProcess] beat: Waking up in 10.00 seconds.
[2019-10-03 18:04:24,805: DEBUG/MainProcess] beat: Synchronizing schedule...
[2019-10-03 18:04:24,806: DEBUG/MainProcess] beat: Extending lock...
[2019-10-03 18:04:24,808: DEBUG/MainProcess] Selecting tasks
...
This is different kind of problem I guess, so maybe I should open a new issue?
EDIT 2:
Tested from master branch and it works. More precisely, it doesn't fall into the incorrect state - while extending the lock on the instance that has just became connected to the network again, it raises an exception that it can't extend lock which is not owned. This works in my setup since I run it inside a Docker with --restart=always
. When the Docker container restarts, the instance hang at Acquiring lock...
which is a correct behavior in the end.
Bump. I am facing this issue and it oddly seems to be leading to loss of data on my single-instance Redis server. Has there been any headway on this?
I'm getting the same error but when my redis server restarts. I suppose that's a different problem, right? Should I open another issue?
I recently encountered this error much more frequently when the redis
python library was accidentally downgraded from 4.1.4 to 3. Might be related.
I have debugged this further and think it is caused by a race condition in schedulers.RedBeatScheduler.close
and the celery beat signal handler. I stopped seeing this error when I killed my Redbeat
using SIGQUIT
instead of SIGTERM
, which avoids calling the beat signal handler. Hopefully this works for others having this issue.
My best guess at what is going on is:
SIGTERM
SIGINT
which calls service.sync()
which eventually will call schedulers.RedBeatScheduler.close
. This happens on the main celery beat thread as that is where signal handlers execute.celery.beat.Service.start
which also crashes out at the same time as the above call to sync
and ends up calling: finally:
self.sync()
if self.lock
guard in schedulers.RedBeatScheduler.close
as the redis self.lock.release
call is possibly slow and blocking (I see in the logs two release calls, presumably from the two racing threads)[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] [2022-03-02 20:01:01,854: DEBUG/MainProcess] beat: Releasing Lock
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] [2022-03-02 20:01:01,854: DEBUG/MainProcess] beat: Releasing Lock
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] [2022-03-02 20:01:01,854: CRITICAL/MainProcess] beat raised exception <class 'redis.exceptions.LockError'>: LockError('Cannot release an unlocked lock')
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] Traceback (most recent call last):
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/celery/beat.py", line 649, in start
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] time.sleep(interval)
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/celery/apps/beat.py", line 150, in _sync
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] service.sync()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/celery/beat.py", line 658, in sync
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] self.scheduler.close()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/redbeat/schedulers.py", line 482, in close
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] self.lock.release()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/redis/lock.py", line 237, in release
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] self.do_release(expected_token)
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/redis/lock.py", line 243, in do_release
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] raise LockNotOwnedError("Cannot release a lock" " that's no longer owned")
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] redis.exceptions.LockNotOwnedError: Cannot release a lock that's no longer owned
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] During handling of the above exception, another exception occurred:
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01]
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] Traceback (most recent call last):
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/celery/apps/beat.py", line 105, in start_scheduler
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] service.start()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/celery/beat.py", line 655, in start
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] self.sync()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/celery/beat.py", line 658, in sync
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] self.scheduler.close()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/redbeat/schedulers.py", line 482, in close
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] self.lock.release()
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] File "/baserow/venv/lib/python3.7/site-packages/redis/lock.py", line 235, in release
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] raise LockError("Cannot release an unlocked lock")
[BEAT_WORKER][2022-03-02 20:01:01] [BEAT_WORKER][2022-03-02 20:01:01] redis.exceptions.LockError: Cannot release an unlocked lock
By instead using SIGQUIT we ensure the SIGINT and SIGTERM celery beat signal handler doesn't get called and only one thread does the cleanup. Feels a bit hacky doing it this way but it works for me!
I have discovered you can also encounter this problem with a running celery beat job if you set celery.conf.beat_max_loop_interval
to any value >= celery.conf.redbeat_lock_timeout
, probably because the main loop wakes up only every beat_max_loop_interval
seconds to extend the lock, but if redbeat_lock_timeout
is shorter than that value (plus the time it takes to extend the lock), the lock times out before it can be extended.
To avoid this, it may help to have the main loop wake up every min(beat_max_loop_interval, redbeat_lock_timeout)
instead, or just be careful when setting those values if you do.
During shutdown of
celery beat
when using theredbeat
scheduler, we have seen the following exceptions: