Open gaffneyc opened 2 years ago
@eregon @headius ?
I don't think there is a way to fix this in this gem.
The Mutex is needed in the code below, to ensure the task is either completed sucessfully or interrupted, but never both:
https://github.com/ruby/timeout/blob/01554f1127786d85e1e435e08c4aeafe17b8f2e5/lib/timeout.rb#L83-L97
An example race without it is we risk in interrupt
to read a stale @done
(or read it just before it's set by finished
) and could @thread.raise
after the thread has completed the timeout
block, which of course would be a very bad bug.
5e0d8e1637afed1a2e912ad588c89127512b1c94 just uses it early but wouldn't change anything, it's used anyway in finished
.
It could be a simple AtomicBoolean instead of a Mutex (with CAS + read), but there is no such thing in Ruby core or stdlib (only in concurrent-ruby, but somehow AtomicBoolean doesn't seem to have a CAS, only AtomicReference).
There are also 2 other Mutex instances used in Timeout (TIMEOUT_THREAD_MUTEX, QUEUE_MUTEX).
trap
in your case?FWIW on CRuby, both Mutex#synchronize and Monitor#synchronize can't be called from trap context (ThreadError)
, but Queue#push/pop is allowed (doesn't help us though):
$ ruby -e 'm=Mutex.new; trap(:INT) { m.synchronize{}; p :OK; exit }; p 1; sleep'
1
^C-e:1:in `synchronize': can't be called from trap context (ThreadError)
$ ruby -rmonitor -e 'm=Monitor.new; trap(:INT) { m.synchronize {}; p :OK; exit }; sleep'
^C-e:1:in `synchronize': can't be called from trap context (ThreadError)
$ ruby -e 'q=Queue.new; trap(:INT) { q.push 2; p q.pop; exit }; sleep'
^C2
We're using Timeout.timeout
in our trap
handler to manage graceful shutdown of child processes. I'm sure we could rewrite our code to avoid the issue but it still feels like a regression.
It's only a regression on CRuby, so please file an issue to https://bugs.ruby-lang.org/. The change itself (#15) avoids creating one thread per call and has huge performance advantages so we will not revert it just for this.
Already said above, but to be clear there is no need to rewrite the code.
The second way to create a new Thread in trap
should work fine and require very little changes.
Arguably maybe Kernel#trap should create a Thread on its own, much like the JVM does, but that's again something to discuss on the CRuby tracker, not here.
Just to summarize the discussion as I understand it.
The new implementation introduced in #15 uses fewer resources and performs better than the previous version (which is awesome) but requires synchronization to work. There is no way to address the regression because Ruby doesn't have an API for atomic operations (read, write, cas, etc...). A more general solution is to ask the CRuby team to make CRuby act the same way as JRuby.
We'll plan to rework our code to pull the Timeout out of the trap handler and into the main thread instead.
We'll plan to rework our code to pull the Timeout out of the trap handler and into the main thread instead.
That makes sense, such cleanup should probably be part of at_exit
or an explicit ensure
.
If you can't synchronize in the trap, perhaps it should just use the old logic that spins up an interrupt thread. Obviously that worked fine before. The new logic is much more efficient, but has introduced this and other issues (#21) so I think we need to provide a multi-faceted solution.
So one thing is you could report a bug/issue to CRuby at https://bugs.ruby-lang.org/, because there is no such restriction to use Mutex in trap for JRuby/TruffleRuby (checked on truffleruby-dev and jruby-dev). Maybe there is an existing ticket about that.
CRuby issue: https://bugs.ruby-lang.org/issues/19473
because there is no such restriction to use Mutex in trap for JRuby/TruffleRuby (checked on truffleruby-dev and jruby-dev).
Because there is no such restriction, there is a low-reproducibility race condition bug in timeout 0.3.2 with JRuby/TruffleRuby. I confirmed it by the following procedure.
1) Add sleep 10
to timeout as follows (to make it easier to reproduce)
130 def self.ensure_timeout_thread_created
131 unless @timeout_thread and @timeout_thread.alive?
132 TIMEOUT_THREAD_MUTEX.synchronize do
133 sleep 10
134 unless @timeout_thread and @timeout_thread.alive?
135 @timeout_thread = create_timeout_thread
136 end
137 end
138 end
139 end
2) Run the next code with TruffleRuby
require "timeout"
trap(:INT) do
Timeout.timeout(1) do
p :ok
end
end
Timeout.timeout(1) do
sleep
end
3) Press Ctrl+C within 10 seconds
$ ruby test.rb
^C/home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `synchronize': deadlock; recursive locking (ThreadError)
from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `ensure_timeout_thread_created'
from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:182:in `timeout'
from test.rb:4:in `block in <main>'
from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:133:in `sleep'
from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:133:in `block in ensure_timeout_thread_created'
from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `synchronize'
from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:132:in `ensure_timeout_thread_created'
from /home/mame/.rbenv/versions/truffleruby+graalvm-22.3.1/graalvm/languages/ruby/lib/gems/gems/timeout-0.3.2/lib/timeout.rb:182:in `timeout'
from test.rb:8:in `<main>'
CRuby prohibits mutex lock in traps to prevent such a race condition issue.
@eregon Do you think who should address this issue? timeout? Or end users? And how? Cc/ @headius
@mame I think that can be addressed by return if TIMEOUT_THREAD_MUTEX.owned?
before TIMEOUT_THREAD_MUTEX.synchronize do
(detect reentrant).
Or also Thread.handle_interrupt(Object => :never) do
around the TIMEOUT_THREAD_MUTEX.synchronize do
(prevent reentrant).
And then AFAIK it works reliably on JRuby and TruffleRuby, but not on CRuby due to that too strict limitation (https://bugs.ruby-lang.org/issues/19473).
A Monitor instead of Mutex here wouldn't fully work because then there could be two threads created (if the trap handlers runs between unless @timeout_thread and @timeout_thread.alive?
and @timeout_thread = create_timeout_thread
).
We would need a AtomicReference or AtomicBoolean here ideally, but Ruby core provides no such thing. That could be enough on its own in this very specific case, because if two threads get into ensure_timeout_thread_created
the second one doesn't need to wait until the timeout thread is created, it just needs to know the thread will be created. It feels brittle though compared to waiting (e.g. if the Thread fails to be created, only one of the two threads will get an exception, that's suboptimal).
We could also maybe use Mutex#try_lock
as a workaround for the CRuby too strict limitation, but that means busy-waiting if we want to wait the thread is created. That doesn't seem better, just working around this arbitrary CRuby limitation.
@mame I think that can be addressed by
return if TIMEOUT_THREAD_MUTEX.owned?
beforeTIMEOUT_THREAD_MUTEX.synchronize do
(detect reentrant).
This doesn't work on CRuby for the example in the description, due to the too strict limitation. So I believe that proves the CRuby restriction is wrong, because AFAIK the behavior would be fully correct with that:
owned?
and let the main thread create the timeout thread (which might happen after the trap handler has returned, that's fine).try_lock
could maybe be used for TIMEOUT_THREAD_MUTEX
although it makes it much harder to reason about and it's incorrect e.g. it will not raise on all threads calling ensure_timeout_thread_created
if Thread.new{}
fails (e.g., no memory or not allowed to create threads, etc).Queue#pop(timeout:)
, but not on other versions.Request
. That used to just use try_lock
and leave the Mutex locked when the Request is done, as a sort of AtomicBoolean, but that doesn't work because CRuby/Ruby releases all Mutex held by a Thread and so that leaks many Mutex until the Thread finishes, and then it takes a long time to release all of them.
Maybe we could track the reversed state, i.e. locked initially, unlocked when done, that should avoid this leak/release all problem.
As of Timeout 0.3.0 you can no longer use
Timeout
in a signal trap handler due toMutex#synchronize
not being callable inside a trap context. I haven't done a git bisect but I believe this was broken in 5e0d8e1637afed1a2e912ad588c89127512b1c94 due to the addition of a mutex on@done
.This works fine with
0.2.0
.Exception