Closed charlesvallieres closed 5 years ago
Wow, great catch. This PR is great -- the only thing I'm not super comfortable with is not intentionally dealing with that thread when not leader (rather leaving it to Ruby to do so on exit). I think the locker needs something like an abort() mechanism, so it doesn't yield on with_lock() but allows that thread to be joined(). Do you think that would still manage to address the issue you're seeing? I can take a run at it in just a sec.
@mattyr I agree!
We can add a limit on the join as done in the Scheduler. Thing is, we already know that all runners that are not leaders will simply timeout and stop, so we're just wasting time.
I think ideally the code in Runner#stop shouldn't have a if
, I think those components should know if they run and stop properly whatever is their state, but this requires more refactoring.
Something like :
def stop(timeout = nil)
logger.debug('Stopping runner')
@scheduler.stop(timeout)
@locker.stop(timeout)
@thread.join if @thread
logger.debug('Stopped runner')
end
So when @locker.stop
is called it will release the mutex but not yield since it never acquired the lock. This way we don't need a public abort method.
We could also simply call terminate on the thread :
def stop(timeout = nil)
logger.debug('Stopping runner')
if @locker.locked?
@scheduler.stop(timeout)
@locker.stop(timeout)
elsif @thread
@thread.terminate
end
@thread.join if @thread
logger.debug('Stopped runner')
end
But I still think you should be able to call .stop
on any component whatever is its state :)
I like these ideas a lot, it was where my thoughts going with this as well. I think the refactoring is worth it, let me take a stab at it. (I prefer using the stops() rather than terminate() as well).
@charlesvallieres left a note on my PR but adding one here as well -- take a peek at #28 and let me know what you think.
Will close in favor of #28
We're using Sideckiq/Sidecloq on multiple servers and we discovered that the Sidekiq servers that were not leaders in Sidecloq would always fail to restart (stop) properly, systemd always have to send a SIGKILL after the initial SIGTERM.
Turns out I think there is a deadlock because of this line : https://github.com/mattyr/sidecloq/blob/e25c366dac3dc2bb097983ac63507ad62f2bb4ab/lib/sidecloq/runner.rb#L32 when using multiple runners at the same time.
The thread will wait forever for the lock (see : https://github.com/mattyr/sidecloq/blob/e25c366dac3dc2bb097983ac63507ad62f2bb4ab/lib/sidecloq/locker.rb#L22) but it might never get it.
So when a runner is not
locked?
(a.k.a. not leader) I think it shouldn't join on the thread and simply exit. I don't think it needs to wait for anything since it is not the leader anyways.I also created a test showing off the deadlock, I had to make some changes to DummyLocker so it works more like the real version.