Deadlock when stopping with multiple runners

mattyr / sidecloq

Recurring / Periodic / Scheduled / Cron job extension for Sidekiq

MIT License

88 stars 12 forks source link

Deadlock when stopping with multiple runners #27

Closed charlesvallieres closed 5 years ago

charlesvallieres commented 5 years ago

We're using Sideckiq/Sidecloq on multiple servers and we discovered that the Sidekiq servers that were not leaders in Sidecloq would always fail to restart (stop) properly, systemd always have to send a SIGKILL after the initial SIGTERM.

Turns out I think there is a deadlock because of this line : https://github.com/mattyr/sidecloq/blob/e25c366dac3dc2bb097983ac63507ad62f2bb4ab/lib/sidecloq/runner.rb#L32 when using multiple runners at the same time.

The thread will wait forever for the lock (see : https://github.com/mattyr/sidecloq/blob/e25c366dac3dc2bb097983ac63507ad62f2bb4ab/lib/sidecloq/locker.rb#L22) but it might never get it.

So when a runner is not locked? (a.k.a. not leader) I think it shouldn't join on the thread and simply exit. I don't think it needs to wait for anything since it is not the leader anyways.

I also created a test showing off the deadlock, I had to make some changes to DummyLocker so it works more like the real version.

mattyr commented 5 years ago

Wow, great catch. This PR is great -- the only thing I'm not super comfortable with is not intentionally dealing with that thread when not leader (rather leaving it to Ruby to do so on exit). I think the locker needs something like an abort() mechanism, so it doesn't yield on with_lock() but allows that thread to be joined(). Do you think that would still manage to address the issue you're seeing? I can take a run at it in just a sec.

charlesvallieres commented 5 years ago

@mattyr I agree!

We can add a limit on the join as done in the Scheduler. Thing is, we already know that all runners that are not leaders will simply timeout and stop, so we're just wasting time.

I think ideally the code in Runner#stop shouldn't have a if, I think those components should know if they run and stop properly whatever is their state, but this requires more refactoring.

Something like :

    def stop(timeout = nil)
      logger.debug('Stopping runner')
      @scheduler.stop(timeout)
      @locker.stop(timeout)
      @thread.join if @thread
      logger.debug('Stopped runner')
    end

So when @locker.stop is called it will release the mutex but not yield since it never acquired the lock. This way we don't need a public abort method.

charlesvallieres commented 5 years ago

We could also simply call terminate on the thread :

    def stop(timeout = nil)
      logger.debug('Stopping runner')
      if @locker.locked?
        @scheduler.stop(timeout)
        @locker.stop(timeout)
      elsif @thread
        @thread.terminate
      end
      @thread.join if @thread
      logger.debug('Stopped runner')
    end

But I still think you should be able to call .stop on any component whatever is its state :)

mattyr commented 5 years ago

I like these ideas a lot, it was where my thoughts going with this as well. I think the refactoring is worth it, let me take a stab at it. (I prefer using the stops() rather than terminate() as well).

mattyr commented 5 years ago

@charlesvallieres left a note on my PR but adding one here as well -- take a peek at #28 and let me know what you think.

charlesvallieres commented 5 years ago

Will close in favor of #28