ruby / timeout

Timeout provides a way to auto-terminate a potentially long-running operation if it hasn't finished in a fixed amount of time.
Other
146 stars 23 forks source link

Timeout cannot be called in a trap handler as of 0.3.0 #17

Open gaffneyc opened 2 years ago

gaffneyc commented 2 years ago

As of Timeout 0.3.0 you can no longer use Timeout in a signal trap handler due to Mutex#synchronize not being callable inside a trap context. I haven't done a git bisect but I believe this was broken in 5e0d8e1637afed1a2e912ad588c89127512b1c94 due to the addition of a mutex on @done.

This works fine with 0.2.0.

require "timeout"

rd, wr = IO.pipe

trap("SIGUSR1") do
  Timeout.timeout(1) do
  end

  # Close the pipe writer to unblock the main thread
  wr.close
end

# Send USR1 to the current process
Process.kill("USR1", Process.pid)

# Wait for the timeout in the signal handler
rd.read
rd.close

Exception

./ruby-3.1.2/gems/timeout-0.3.0/lib/timeout.rb:128:in `synchronize': can't be called from trap context (ThreadError)
    from ./ruby-3.1.2/gems/timeout-0.3.0/lib/timeout.rb:128:in `ensure_timeout_thread_created'
    from ./ruby-3.1.2/gems/timeout-0.3.0/lib/timeout.rb:171:in `timeout'
    from test.rb:6:in `block in <main>'
    from test.rb:12:in `kill'
    from test.rb:12:in `<main>'
hsbt commented 2 years ago

@eregon @headius ?

eregon commented 2 years ago

I don't think there is a way to fix this in this gem. The Mutex is needed in the code below, to ensure the task is either completed sucessfully or interrupted, but never both: https://github.com/ruby/timeout/blob/01554f1127786d85e1e435e08c4aeafe17b8f2e5/lib/timeout.rb#L83-L97 An example race without it is we risk in interrupt to read a stale @done (or read it just before it's set by finished) and could @thread.raise after the thread has completed the timeout block, which of course would be a very bad bug.

5e0d8e1637afed1a2e912ad588c89127512b1c94 just uses it early but wouldn't change anything, it's used anyway in finished.

It could be a simple AtomicBoolean instead of a Mutex (with CAS + read), but there is no such thing in Ruby core or stdlib (only in concurrent-ruby, but somehow AtomicBoolean doesn't seem to have a CAS, only AtomicReference).

There are also 2 other Mutex instances used in Timeout (TIMEOUT_THREAD_MUTEX, QUEUE_MUTEX).


FWIW on CRuby, both Mutex#synchronize and Monitor#synchronize can't be called from trap context (ThreadError), but Queue#push/pop is allowed (doesn't help us though):

$ ruby -e 'm=Mutex.new; trap(:INT) { m.synchronize{}; p :OK; exit }; p 1; sleep'
1
^C-e:1:in `synchronize': can't be called from trap context (ThreadError)
$ ruby -rmonitor -e 'm=Monitor.new; trap(:INT) { m.synchronize {}; p :OK; exit }; sleep'
^C-e:1:in `synchronize': can't be called from trap context (ThreadError)
$ ruby -e 'q=Queue.new; trap(:INT) { q.push 2; p q.pop; exit }; sleep'
^C2
gaffneyc commented 2 years ago

We're using Timeout.timeout in our trap handler to manage graceful shutdown of child processes. I'm sure we could rewrite our code to avoid the issue but it still feels like a regression.

eregon commented 2 years ago

It's only a regression on CRuby, so please file an issue to https://bugs.ruby-lang.org/. The change itself (#15) avoids creating one thread per call and has huge performance advantages so we will not revert it just for this.

eregon commented 2 years ago

Already said above, but to be clear there is no need to rewrite the code. The second way to create a new Thread in trap should work fine and require very little changes. Arguably maybe Kernel#trap should create a Thread on its own, much like the JVM does, but that's again something to discuss on the CRuby tracker, not here.

gaffneyc commented 2 years ago

Just to summarize the discussion as I understand it.

The new implementation introduced in #15 uses fewer resources and performs better than the previous version (which is awesome) but requires synchronization to work. There is no way to address the regression because Ruby doesn't have an API for atomic operations (read, write, cas, etc...). A more general solution is to ask the CRuby team to make CRuby act the same way as JRuby.

We'll plan to rework our code to pull the Timeout out of the trap handler and into the main thread instead.

eregon commented 2 years ago

We'll plan to rework our code to pull the Timeout out of the trap handler and into the main thread instead.

That makes sense, such cleanup should probably be part of at_exit or an explicit ensure.

headius commented 2 years ago

If you can't synchronize in the trap, perhaps it should just use the old logic that spins up an interrupt thread. Obviously that worked fine before. The new logic is much more efficient, but has introduced this and other issues (#21) so I think we need to provide a multi-faceted solution.

eregon commented 1 year ago

So one thing is you could report a bug/issue to CRuby at https://bugs.ruby-lang.org/, because there is no such restriction to use Mutex in trap for JRuby/TruffleRuby (checked on truffleruby-dev and jruby-dev). Maybe there is an existing ticket about that.

CRuby issue: https://bugs.ruby-lang.org/issues/19473

mame commented 1 year ago

because there is no such restriction to use Mutex in trap for JRuby/TruffleRuby (checked on truffleruby-dev and jruby-dev).

Because there is no such restriction, there is a low-reproducibility race condition bug in timeout 0.3.2 with JRuby/TruffleRuby. I confirmed it by the following procedure.

1) Add sleep 10 to timeout as follows (to make it easier to reproduce)

130   def self.ensure_timeout_thread_created
131     unless @timeout_thread and @timeout_thread.alive?
132       TIMEOUT_THREAD_MUTEX.synchronize do
133         sleep 10
134         unless @timeout_thread and @timeout_thread.alive?
135           @timeout_thread = create_timeout_thread
136         end
137       end
138     end
139   end

2) Run the next code with TruffleRuby

require "timeout"

trap(:INT) do
  Timeout.timeout(1) do
    p :ok
  end
end

Timeout.timeout(1) do
  sleep
end

3) Press Ctrl+C within 10 seconds

$ ruby test.rb
^C/home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `synchronize': deadlock; recursive locking (ThreadError)
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `ensure_timeout_thread_created'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:182:in `timeout'
        from test.rb:4:in `block in <main>'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:133:in `sleep'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:133:in `block in ensure_timeout_thread_created'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `synchronize'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `ensure_timeout_thread_created'
        from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:182:in `timeout'
        from test.rb:8:in `<main>'

CRuby prohibits mutex lock in traps to prevent such a race condition issue.

@eregon Do you think who should address this issue? timeout? Or end users? And how? Cc/ @headius

eregon commented 1 year ago

@mame I think that can be addressed by return if TIMEOUT_THREAD_MUTEX.owned? before TIMEOUT_THREAD_MUTEX.synchronize do (detect reentrant). Or also Thread.handle_interrupt(Object => :never) do around the TIMEOUT_THREAD_MUTEX.synchronize do (prevent reentrant). And then AFAIK it works reliably on JRuby and TruffleRuby, but not on CRuby due to that too strict limitation (https://bugs.ruby-lang.org/issues/19473).

A Monitor instead of Mutex here wouldn't fully work because then there could be two threads created (if the trap handlers runs between unless @timeout_thread and @timeout_thread.alive? and @timeout_thread = create_timeout_thread).

We would need a AtomicReference or AtomicBoolean here ideally, but Ruby core provides no such thing. That could be enough on its own in this very specific case, because if two threads get into ensure_timeout_thread_created the second one doesn't need to wait until the timeout thread is created, it just needs to know the thread will be created. It feels brittle though compared to waiting (e.g. if the Thread fails to be created, only one of the two threads will get an exception, that's suboptimal). We could also maybe use Mutex#try_lock as a workaround for the CRuby too strict limitation, but that means busy-waiting if we want to wait the thread is created. That doesn't seem better, just working around this arbitrary CRuby limitation.

eregon commented 1 year ago

@mame I think that can be addressed by return if TIMEOUT_THREAD_MUTEX.owned? before TIMEOUT_THREAD_MUTEX.synchronize do (detect reentrant).

This doesn't work on CRuby for the example in the description, due to the too strict limitation. So I believe that proves the CRuby restriction is wrong, because AFAIK the behavior would be fully correct with that:

eregon commented 1 year ago