Closed itarato closed 9 months ago
Thanks for the report.
So this message comes from https://github.com/oracle/truffleruby/blob/a3d954350c96117cce3de561f2ce8fd0ebc33b39/src/main/java/org/truffleruby/core/thread/ThreadNodes.java#L694-L723
We need to execute the unblock function given to rb_thread_call_without_gvl()
, so the thread inside the blocking function of rb_thread_call_without_gvl()
can get out of it (e.g. out of a blocking syscall) and reach a Truffle safepoint, and then start its operation again.
What changed here is the CPUSampler now uses a system thread vs a embedder thread before, and system threads don't allow entering the context.
I don't know how to fix this, we need to execute the blocking function on some thread and that requires entering the context:
rb_thread_call_without_gvl()
. And we could have recursion problems if we use a ThreadLocalAction inside an Interrupter#interrupt.rb_thread_call_without_gvl()
, that's the one stuck until we run the unblock function.rb_thread_call_without_gvl()
.rb_thread_call_without_gvl()
does require at least 2 parallel threads anyway (the thread blocked in it, the thread to run the unblock function), so maybe that thread could be created lazily the first time it is needed and not actually break polyglot. Also that thread will appear e.g. in Ruby's Thread.list which might be unexpected for user code.We already have the Timer nativeInterruptTimer
to send SIGVTALRM to interrupt NFI blocking syscalls. That's an embedder thread because there is no ThreadFactory for java.util.Timer
. We don't control what it does so we can't reuse it for this. We also can't reuse the ReferenceProcessingThread because that's optional. My concern is we start to need a lot of threads to workaround Truffle limitations.
@chumer @jchalou What's is the envisioned solution here? Maybe tools should use embedder threads and not system threads to avoid this problem? It is a breaking change in CPUSampler from Ruby's POV.
@eregon Tools must use system threads to correctly work with Truffle isolates. The ordinary Java thread is not able to do a downcall to the host VM.
@tzezula Why not? By embedder thread
I mean new java.lang.Thread()
.
The problem with java.lang.Thread
in Truffle isolate is that it's not attached to host (HotSpot) JavaVM and has no host JNIEnv. When such a thread performs an operation that needs call to host, for example logging, it will fail.
The system threads created in the Truffle isolate are always attached to host in the enterpise implementation of AbstractPolyglotImpl#createThreadScope().
@tzezula, we handle embedder threads in polyglot isolates. The problem is with threads created by the language that are not system or polyglot threads. They are unsupported now but will only fail if you use polyglot isolates.
No embedder threads are not the solution. Embedder threads are only intended for embedders, as the name suggests.
What changed here is the CPUSampler now uses a system thread vs a embedder thread before, and system threads don't allow entering the context.
Why would you do anything on a CPU sampler thread? Sorry, I cannot yet see why system threads are problematic here. Can you elaborate?
Why would you do anything on a CPU sampler thread? Sorry, I cannot yet see why system threads are problematic here. > Can you elaborate?
So CPUSampler creates a samplerExecutionService system thread. And it submitThreadLocal
. That then calls the various Interrupter
s on the current thread, including the Interrupter we care about it here to call the unblocking function of a thread in rb_thread_call_without_gvl()
.
@chumer Right, we handle embedder threads in Truffle isolate. Embedder threads originated in the host VM and are always attached to it. But @eregon uses the term "embedder thread" for ordinary Java thread (neither system nor polyglot) created in the isolate.
This is still an issue, @chumer @tzezula could you recommend a solution?
One idea is if unblock functions do not need Sulong and do not need a Truffle context, we could execute the native function (via JNI, not NFI which needs the context) directly on the "system thread of CPUSampler", without needing to enter the Truffle context.
But https://github.com/oracle/truffleruby/blob/7c5bef1358b704a5cfa67ff17c4c763d567ff9e0/lib/cext/include/ruby/thread.h#L94-L131 and https://github.com/oracle/truffleruby/blob/master/lib/cext/include/ruby/internal/intern/thread.h#L330-L343
do not specify whether the unblock functions of rb_thread_call_without_gvl*
can call rb_*
or so, which would be a problem as those need to be entered in the context as almost all of them end up running some Sulong or Ruby code.
It might hold though.
self-note: some cleanup at bd/system-thread-cext-interrupter
. I can try if that'd work after https://github.com/oracle/truffleruby/pull/3356 is merged
I chatted with @chumer. There is no perfect solution right now, so we have to use an extra (polyglot) thread or use JNI and then expect the unblock function will not call back to the Ruby C API (if so we need to test how it fails when that does not hold, it would need to be a clear error and e.g. not a segfault). Longer-term there might be some kind of async-profiler for Truffle or the functionality to read the guest stacktrace of another thread without needing to run code on that thread, i.e., also working when that thread is in a blocking native call.
if so we need to test how it fails when that does not hold, it would need to be a clear error and e.g. not a segfault.
With asserts on it's rather clear:
$ jt test -e rb_thread_call_without_gvl spec/ruby/optional/capi/thread_spec.rb -- --cpusampler
Exception in thread "SIGHUP handler" java.lang.AssertionError: No polyglot context is entered. A language or context reference must not be used if there is no polyglot context entered.
at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotFastThreadLocals.assertValidGet(PolyglotFastThreadLocals.java:481)
at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotFastThreadLocals$ContextReferenceImpl.get(PolyglotFastThreadLocals.java:513)
at com.oracle.truffle.truffle_nfi_libffi/com.oracle.truffle.nfi.backend.libffi.LibFFIContext.get(LibFFIContext.java:359)
but without it's not:
Exception in thread "SIGHUP handler" java.lang.NullPointerException: Cannot invoke "com.oracle.truffle.nfi.backend.libffi.LibFFIContext.executeNative(long, long, byte[], int, int[], Object[], byte[])" because "ctx" is null
at com.oracle.truffle.truffle_nfi_libffi/com.oracle.truffle.nfi.backend.libffi.LibFFISignature$CachedSignatureInfo.execute(LibFFISignature.java:322)
at com.oracle.truffle.truffle_nfi_libffi/com.oracle.truffle.nfi.backend.libffi.FunctionExecuteNode$SignatureExecuteNode.execute(FunctionExecuteNode.java:151)
at org.graalvm.truffle/com.oracle.truffle.api.impl.DefaultCallTarget.callDirectOrIndirect(DefaultCallTarget.java:85)
at org.graalvm.truffle/com.oracle.truffle.api.impl.DefaultDirectCallNode.call(DefaultDirectCallNode.java:59)
I tried creating a new thread in CExtInterrupter#interrupt
but that does not work, because the context lock is held while the sampler ThreadLocalAction is running:
"Sampling thread" #34 [90231] daemon prio=5 os_prio=0 cpu=144.69ms elapsed=7.08s tid=0x00005627f2a4ff00 nid=90231 in Object.wait() [0x00007f8304c2d000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait0(java.base@21.0.1/Native Method)
- waiting on <0x00000007ffc1b470> (a com.oracle.truffle.polyglot.PolyglotThread)
at java.lang.Object.wait(java.base@21.0.1/Object.java:366)
at java.lang.Thread.join(java.base@21.0.1/Thread.java:2078)
- locked <0x00000007ffc1b470> (a com.oracle.truffle.polyglot.PolyglotThread)
at java.lang.Thread.join(java.base@21.0.1/Thread.java:2154)
at org.truffleruby.core.thread.ThreadNodes$CallWithUnblockingFunctionNode$CExtInterrupter.interrupt(org.graalvm.ruby/ThreadNodes.java:722)
at com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.setFastPendingAndInterrupt(org.graalvm.truffle/ThreadLocalHandshake.java:487)
at com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.addHandshakeImpl(org.graalvm.truffle/ThreadLocalHandshake.java:474)
at com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.addHandshake(org.graalvm.truffle/ThreadLocalHandshake.java:465)
at com.oracle.truffle.api.impl.ThreadLocalHandshake.addHandshakes(org.graalvm.truffle/ThreadLocalHandshake.java:136)
at com.oracle.truffle.api.impl.ThreadLocalHandshake.runThreadLocal(org.graalvm.truffle/ThreadLocalHandshake.java:125)
at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:309)
- locked <0x0000000418811aa0> (a com.oracle.truffle.polyglot.PolyglotContextImpl)
at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:236)
at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:232)
at com.oracle.truffle.polyglot.EngineAccessor$EngineImpl.submitThreadLocal(org.graalvm.truffle/EngineAccessor.java:1812)
at com.oracle.truffle.api.instrumentation.TruffleInstrument$Env.submitThreadLocal(org.graalvm.truffle/TruffleInstrument.java:1336)
at com.oracle.truffle.tools.profiler.SafepointStackSampler.sample(com.oracle.truffle.tools.profiler/SafepointStackSampler.java:101)
at com.oracle.truffle.tools.profiler.CPUSampler$SamplingTask.run(com.oracle.truffle.tools.profiler/CPUSampler.java:753)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@21.0.1/Executors.java:572)
at java.util.concurrent.FutureTask.runAndReset(java.base@21.0.1/FutureTask.java:358)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@21.0.1/ScheduledThreadPoolExecutor.java:305)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.1/ThreadPoolExecutor.java:1144)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.1/ThreadPoolExecutor.java:642)
at java.lang.Thread.runWith(java.base@21.0.1/Thread.java:1596)
at java.lang.Thread.run(java.base@21.0.1/Thread.java:1583)
at com.oracle.truffle.polyglot.SystemThread.run(org.graalvm.truffle/SystemThread.java:68)
And the main thread ends up stuck, waiting for that lock:
"main" #1 [90109] prio=5 os_prio=0 cpu=6872.82ms elapsed=7.58s tid=0x00005627f242c340 nid=90109 waiting for monitor entry [0x00007ffd2eb8e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:248)
- waiting to lock <0x0000000418811aa0> (a com.oracle.truffle.polyglot.PolyglotContextImpl)
at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:236)
at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:232)
at com.oracle.truffle.polyglot.EngineAccessor$EngineImpl.submitThreadLocal(org.graalvm.truffle/EngineAccessor.java:1812)
at com.oracle.truffle.api.TruffleLanguage$Env.submitThreadLocalInternal(org.graalvm.truffle/TruffleLanguage.java:3729)
at com.oracle.truffle.api.TruffleLanguage$Env.submitThreadLocal(org.graalvm.truffle/TruffleLanguage.java:3686)
at org.truffleruby.language.SafepointManager.pauseAllThreadsAndExecute(org.graalvm.ruby/SafepointManager.java:52)
at org.truffleruby.language.SafepointManager.pauseRubyThreadAndExecute(org.graalvm.ruby/SafepointManager.java:44)
at org.truffleruby.core.thread.ThreadNodes$WakeupNode.wakeup(org.graalvm.ruby/ThreadNodes.java:640)
EDIT: it fixes the deadlock to not join() the thread there, so that's fine. Then it's the same failure as with JNI (wakeup raising ThreadError: killed thread)
I also tried just calling the unblock function via JNI (which would avoid the overhead of an extra thread and communicating with it for each sampling ThreadLocalAction).
That seems to work for spec/ruby/optional/capi/thread_spec.rb
, but the spec also fails as a side-effect with wakeup
raising ThreadError: killed thread
since the unblock already ran and so the thread already terminated.
Which means CPUSampler has that side effect, unless the blocking function automatically retries after checking interrupts.
Obviously, not great but not sure there is a solution.
Also not clear to me if it's safe to call the unblock function multiple times, but it probably needs to be e.g. if the blocking function retries after checking interrupts (= rb_thread_check_ints()
).
It'd be good to look at real-world usages of rb_thread_call_without_gvl()
and see how it behaves after being interrupted.
Ideally they would just retry after checking interrupts, but not sure that's what they do.
Some more thoughts here:
rb_thread_call_with_gvl()
and leaveAndEnter() does not support nested enter (IIRC) it's not an option. It would also of course not account the blocking functions at all which is not great.Ideally, I don't think we'd have to deal with the unblock function at all. I suppose that's what you mean with an analog to async_profiler. I don't want collecting a profile to affect the behavior of running code beyond performance overhead. Interrupting a blocking call that would naturally unblock is not ideal.
With that said, I'd settle for whatever we can do. I'm finding the sampling profiler doesn't work with a real world Rails application. There's a lot of thread calls going on without the GVL in the database driver, OpenSSL, ZLib, and others.
Ideally, I don't think we'd have to deal with the unblock function at all. I suppose that's what you mean with an analog to async_profiler. I don't want collecting a profile to affect the behavior of running code beyond performance overhead. Interrupting a blocking call that would naturally unblock is not ideal.
Yes, exactly, ideally we wouldn't need to, but currently the only way to get a guest stacktrace of another thread is to run TruffleRuntime#iterateFrames
on that thread, which means using TruffleSafepoint and reaching a guest safepoint reasonably quickly.
With that said, I'd settle for whatever we can do. I'm finding the sampling profiler doesn't work with a real world Rails application. There's a lot of thread calls going on without the GVL in the database driver, OpenSSL, ZLib, and others.
Note that many calls to rb_thread_call_without_gvl()
use RUBY_UBF_IO
as the unblock "function", which is sending SIGVTALRM to interrupt it and we already do this with JNI so that should work fine, except of course it also has the side effect of interrupting that blocking call (for every sample while in it) while it would not be interrupted in CRuby.
Ideally when such a blocking call is interrupted it would check interrupts (rb_thread_check_ints()
) and then retry (e.g. until it doesn't get EINTR when using RUBY_UBF_IO
).
So for example it should not abort and not retry if the unblock function is called.
But I'm not sure all calls to rb_thread_call_without_gvl()
do this correctly.
They should though because CRuby uses the unblock function e.g. to run a Ruby-defined signal handler on the main Ruby thread and that can happen multiple times and should have side effects on the program besides what the signal handler does.
Fix in https://github.com/oracle/truffleruby/pull/3385.
I verified that with the fix truffleruby succeeds in capturing samples while a thread is inside rb_thread_call_without_gvl()
with a custom unblock function.
The profile for that thread is mostly attributed to rb_thread_check_ints()/rb_thread_call_with_gvl() because that's where the guest safepoint poll happens, which makes sense and is expected although the actual time is mostly spent in the blocking function but that itself can't be polled so that's expected safepoint bias. We still see those are above the rb_thread_call_without_gvl()
in the flamegraph so it's rather clear.
Here is the flamegraph from the new spec for rb_thread_call_without_gvl()
with a custom unblock function that I'm talking about above:
And before the fix (on truffleruby-dev native) it looked like this: i.e. it could not get any sample while a thread was inside rb_thread_call_without_gvl()` with a custom unblock function.
cpusampler
is yielding threading errors a various runs. Example when running the thread spec tests:The output contains a few tens of:
Another example is when running rails tests (not necessarily related to multithreading).