Timeout cannot be called in a trap handler as of 0.3.0

gaffneyc commented 2 years ago

As of Timeout 0.3.0 you can no longer use Timeout in a signal trap handler due to Mutex#synchronize not being callable inside a trap context. I haven't done a git bisect but I believe this was broken in 5e0d8e1637afed1a2e912ad588c89127512b1c94 due to the addition of a mutex on @done.

This works fine with 0.2.0.

require "timeout"

rd, wr = IO.pipe

trap("SIGUSR1") do
  Timeout.timeout(1) do
  end

  # Close the pipe writer to unblock the main thread
  wr.close
end

# Send USR1 to the current process
Process.kill("USR1", Process.pid)

# Wait for the timeout in the signal handler
rd.read
rd.close

Exception

./ruby-3.1.2/gems/timeout-0.3.0/lib/timeout.rb:128:in `synchronize': can't be called from trap context (ThreadError)
    from ./ruby-3.1.2/gems/timeout-0.3.0/lib/timeout.rb:128:in `ensure_timeout_thread_created'
    from ./ruby-3.1.2/gems/timeout-0.3.0/lib/timeout.rb:171:in `timeout'
    from test.rb:6:in `block in <main>'
    from test.rb:12:in `kill'
    from test.rb:12:in `<main>'

hsbt commented 2 years ago

@eregon @headius ?

eregon commented 2 years ago

I don't think there is a way to fix this in this gem. The Mutex is needed in the code below, to ensure the task is either completed sucessfully or interrupted, but never both: https://github.com/ruby/timeout/blob/01554f1127786d85e1e435e08c4aeafe17b8f2e5/lib/timeout.rb#L83-L97 An example race without it is we risk in interrupt to read a stale @done (or read it just before it's set by finished) and could @thread.raise after the thread has completed the timeout block, which of course would be a very bad bug.

5e0d8e1637afed1a2e912ad588c89127512b1c94 just uses it early but wouldn't change anything, it's used anyway in finished.

It could be a simple AtomicBoolean instead of a Mutex (with CAS + read), but there is no such thing in Ruby core or stdlib (only in concurrent-ruby, but somehow AtomicBoolean doesn't seem to have a CAS, only AtomicReference).

There are also 2 other Mutex instances used in Timeout (TIMEOUT_THREAD_MUTEX, QUEUE_MUTEX).

So one thing is you could report a bug/issue to CRuby at https://bugs.ruby-lang.org/, because there is no such restriction to use Mutex in trap for JRuby/TruffleRuby (checked on truffleruby-dev and jruby-dev). Maybe there is an existing ticket about that.
The second is I think a common workaround on CRuby is to create a Thread inside of the trap handler to actually be able to run Ruby code without extra restrictions.
And third maybe it's possible to avoid using Timeout.timeout in trap in your case?

FWIW on CRuby, both Mutex#synchronize and Monitor#synchronize can't be called from trap context (ThreadError), but Queue#push/pop is allowed (doesn't help us though):

$ ruby -e 'm=Mutex.new; trap(:INT) { m.synchronize{}; p :OK; exit }; p 1; sleep'
1
^C-e:1:in `synchronize': can't be called from trap context (ThreadError)
$ ruby -rmonitor -e 'm=Monitor.new; trap(:INT) { m.synchronize {}; p :OK; exit }; sleep'
^C-e:1:in `synchronize': can't be called from trap context (ThreadError)
$ ruby -e 'q=Queue.new; trap(:INT) { q.push 2; p q.pop; exit }; sleep'
^C2

gaffneyc commented 2 years ago

We're using Timeout.timeout in our trap handler to manage graceful shutdown of child processes. I'm sure we could rewrite our code to avoid the issue but it still feels like a regression.

eregon commented 2 years ago

It's only a regression on CRuby, so please file an issue to https://bugs.ruby-lang.org/. The change itself (#15) avoids creating one thread per call and has huge performance advantages so we will not revert it just for this.

eregon commented 2 years ago

Already said above, but to be clear there is no need to rewrite the code. The second way to create a new Thread in trap should work fine and require very little changes. Arguably maybe Kernel#trap should create a Thread on its own, much like the JVM does, but that's again something to discuss on the CRuby tracker, not here.

gaffneyc commented 2 years ago

Just to summarize the discussion as I understand it.

The new implementation introduced in #15 uses fewer resources and performs better than the previous version (which is awesome) but requires synchronization to work. There is no way to address the regression because Ruby doesn't have an API for atomic operations (read, write, cas, etc...). A more general solution is to ask the CRuby team to make CRuby act the same way as JRuby.

We'll plan to rework our code to pull the Timeout out of the trap handler and into the main thread instead.

eregon commented 2 years ago

We'll plan to rework our code to pull the Timeout out of the trap handler and into the main thread instead.

That makes sense, such cleanup should probably be part of at_exit or an explicit ensure.

headius commented 2 years ago

If you can't synchronize in the trap, perhaps it should just use the old logic that spins up an interrupt thread. Obviously that worked fine before. The new logic is much more efficient, but has introduced this and other issues (#21) so I think we need to provide a multi-faceted solution.

eregon commented 1 year ago

So one thing is you could report a bug/issue to CRuby at https://bugs.ruby-lang.org/, because there is no such restriction to use Mutex in trap for JRuby/TruffleRuby (checked on truffleruby-dev and jruby-dev). Maybe there is an existing ticket about that.

CRuby issue: https://bugs.ruby-lang.org/issues/19473

mame commented 1 year ago

because there is no such restriction to use Mutex in trap for JRuby/TruffleRuby (checked on truffleruby-dev and jruby-dev).

Because there is no such restriction, there is a low-reproducibility race condition bug in timeout 0.3.2 with JRuby/TruffleRuby. I confirmed it by the following procedure.

1) Add sleep 10 to timeout as follows (to make it easier to reproduce)

130   def self.ensure_timeout_thread_created
131     unless @timeout_thread and @timeout_thread.alive?
132       TIMEOUT_THREAD_MUTEX.synchronize do
133         sleep 10
134         unless @timeout_thread and @timeout_thread.alive?
135           @timeout_thread = create_timeout_thread
136         end
137       end
138     end
139   end

2) Run the next code with TruffleRuby

require "timeout"

trap(:INT) do
  Timeout.timeout(1) do
    p :ok
  end
end

Timeout.timeout(1) do
  sleep
end

3) Press Ctrl+C within 10 seconds

$ ruby test.rb
^C/home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `synchronize': deadlock; recursive locking (ThreadError)
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `ensure_timeout_thread_created'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:182:in `timeout'
        from test.rb:4:in `block in <main>'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:133:in `sleep'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:133:in `block in ensure_timeout_thread_created'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `synchronize'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `ensure_timeout_thread_created'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:182:in `timeout'
        from test.rb:8:in `<main>'

CRuby prohibits mutex lock in traps to prevent such a race condition issue.

@eregon Do you think who should address this issue? timeout? Or end users? And how? Cc/ @headius

eregon commented 1 year ago

@mame I think that can be addressed by return if TIMEOUT_THREAD_MUTEX.owned? before TIMEOUT_THREAD_MUTEX.synchronize do (detect reentrant). Or also Thread.handle_interrupt(Object => :never) do around the TIMEOUT_THREAD_MUTEX.synchronize do (prevent reentrant). And then AFAIK it works reliably on JRuby and TruffleRuby, but not on CRuby due to that too strict limitation (https://bugs.ruby-lang.org/issues/19473).

A Monitor instead of Mutex here wouldn't fully work because then there could be two threads created (if the trap handlers runs between unless @timeout_thread and @timeout_thread.alive? and @timeout_thread = create_timeout_thread).

We would need a AtomicReference or AtomicBoolean here ideally, but Ruby core provides no such thing. That could be enough on its own in this very specific case, because if two threads get into ensure_timeout_thread_created the second one doesn't need to wait until the timeout thread is created, it just needs to know the thread will be created. It feels brittle though compared to waiting (e.g. if the Thread fails to be created, only one of the two threads will get an exception, that's suboptimal). We could also maybe use Mutex#try_lock as a workaround for the CRuby too strict limitation, but that means busy-waiting if we want to wait the thread is created. That doesn't seem better, just working around this arbitrary CRuby limitation.

eregon commented 1 year ago

@mame I think that can be addressed by return if TIMEOUT_THREAD_MUTEX.owned? before TIMEOUT_THREAD_MUTEX.synchronize do (detect reentrant).

This doesn't work on CRuby for the example in the description, due to the too strict limitation. So I believe that proves the CRuby restriction is wrong, because AFAIK the behavior would be fully correct with that:

If the main thread holds the Mutex when the trap handler is called, it will be seen as owned? and let the main thread create the timeout thread (which might happen after the trap handler has returned, that's fine).
If the main thread doesn't hold the Mutex when the trap handler is called, then the trap handler takes the Mutex, creates the Timeout thread, returns, and then the main thread takes the Mutex, sees the timeout thread is already created and we are fine.

eregon commented 1 year ago

try_lock could maybe be used for TIMEOUT_THREAD_MUTEX although it makes it much harder to reason about and it's incorrect e.g. it will not raise on all threads calling ensure_timeout_thread_created if Thread.new{} fails (e.g., no memory or not allowed to create threads, etc).
But then we run in the same limitation on CRuby with the QUEUE_MUTEX. This could be removed on CRuby 3.2+ which has Queue#pop(timeout:), but not on other versions.
And there is one more Mutex, on the Request. That used to just use try_lock and leave the Mutex locked when the Request is done, as a sort of AtomicBoolean, but that doesn't work because CRuby/Ruby releases all Mutex held by a Thread and so that leaks many Mutex until the Thread finishes, and then it takes a long time to release all of them. Maybe we could track the reversed state, i.e. locked initially, unlocked when done, that should avoid this leak/release all problem.

ruby / timeout

Timeout cannot be called in a trap handler as of 0.3.0 #17