[Bug / Concurrency] CPU sampler during multi threaded operation yields error: Context cannot be entered on system threads

itarato commented 1 year ago

cpusampler is yielding threading errors a various runs. Example when running the thread spec tests:

jt test spec/ruby/optional/capi/thread_spec.rb -- --cpusampler

The output contains a few tens of:

[ruby] SEVERE: could not unblock thread inside blocking call in C extension because the context does not allow multithreading (Context cannot be entered on system threads.)

Another example is when running rails tests (not necessarily related to multithreading).

eregon commented 1 year ago

Thanks for the report.

So this message comes from https://github.com/oracle/truffleruby/blob/a3d954350c96117cce3de561f2ce8fd0ebc33b39/src/main/java/org/truffleruby/core/thread/ThreadNodes.java#L694-L723

We need to execute the unblock function given to rb_thread_call_without_gvl(), so the thread inside the blocking function of rb_thread_call_without_gvl() can get out of it (e.g. out of a blocking syscall) and reach a Truffle safepoint, and then start its operation again.

What changed here is the CPUSampler now uses a system thread vs a embedder thread before, and system threads don't allow entering the context.

I don't know how to fix this, we need to execute the blocking function on some thread and that requires entering the context:

We can't execute it on the current system thread due to system threads don't allow entering the context (this issue)
We could use a Truffle safepoint/ThreadLocalAction, but to which thread? We can't use the Ruby main thread, maybe that's the one in rb_thread_call_without_gvl(). And we could have recursion problems if we use a ThreadLocalAction inside an Interrupter#interrupt.
Obviously we can't use the thread in rb_thread_call_without_gvl(), that's the one stuck until we run the unblock function.
I suppose we could create a new thread with Env#createThread, but that seems prohibitively expensive for every Interrupter#interrupt, i.e., every ThreadLocalAction affecting that thread while it's in rb_thread_call_without_gvl().
I suppose we could have an extra thread that we keep around for running that, but that is complicated and not working if there is any single-threaded language like JS/Python/etc. OTOH unblocking rb_thread_call_without_gvl() does require at least 2 parallel threads anyway (the thread blocked in it, the thread to run the unblock function), so maybe that thread could be created lazily the first time it is needed and not actually break polyglot. Also that thread will appear e.g. in Ruby's Thread.list which might be unexpected for user code.

We already have the Timer nativeInterruptTimer to send SIGVTALRM to interrupt NFI blocking syscalls. That's an embedder thread because there is no ThreadFactory for java.util.Timer. We don't control what it does so we can't reuse it for this. We also can't reuse the ReferenceProcessingThread because that's optional. My concern is we start to need a lot of threads to workaround Truffle limitations.

@chumer @jchalou What's is the envisioned solution here? Maybe tools should use embedder threads and not system threads to avoid this problem? It is a breaking change in CPUSampler from Ruby's POV.

tzezula commented 1 year ago

@eregon Tools must use system threads to correctly work with Truffle isolates. The ordinary Java thread is not able to do a downcall to the host VM.

eregon commented 1 year ago

@tzezula Why not? By embedder thread I mean new java.lang.Thread().

tzezula commented 1 year ago

The problem with java.lang.Thread in Truffle isolate is that it's not attached to host (HotSpot) JavaVM and has no host JNIEnv. When such a thread performs an operation that needs call to host, for example logging, it will fail. The system threads created in the Truffle isolate are always attached to host in the enterpise implementation of AbstractPolyglotImpl#createThreadScope().

chumer commented 1 year ago

@tzezula, we handle embedder threads in polyglot isolates. The problem is with threads created by the language that are not system or polyglot threads. They are unsupported now but will only fail if you use polyglot isolates.

No embedder threads are not the solution. Embedder threads are only intended for embedders, as the name suggests.

What changed here is the CPUSampler now uses a system thread vs a embedder thread before, and system threads don't allow entering the context.

Why would you do anything on a CPU sampler thread? Sorry, I cannot yet see why system threads are problematic here. Can you elaborate?

eregon commented 1 year ago

Why would you do anything on a CPU sampler thread? Sorry, I cannot yet see why system threads are problematic here. > Can you elaborate?

So CPUSampler creates a samplerExecutionService system thread. And it submitThreadLocal. That then calls the various Interrupters on the current thread, including the Interrupter we care about it here to call the unblocking function of a thread in rb_thread_call_without_gvl().

tzezula commented 1 year ago

@chumer Right, we handle embedder threads in Truffle isolate. Embedder threads originated in the host VM and are always attached to it. But @eregon uses the term "embedder thread" for ordinary Java thread (neither system nor polyglot) created in the isolate.

eregon commented 10 months ago

This is still an issue, @chumer @tzezula could you recommend a solution?

eregon commented 10 months ago

One idea is if unblock functions do not need Sulong and do not need a Truffle context, we could execute the native function (via JNI, not NFI which needs the context) directly on the "system thread of CPUSampler", without needing to enter the Truffle context. But https://github.com/oracle/truffleruby/blob/7c5bef1358b704a5cfa67ff17c4c763d567ff9e0/lib/cext/include/ruby/thread.h#L94-L131 and https://github.com/oracle/truffleruby/blob/master/lib/cext/include/ruby/internal/intern/thread.h#L330-L343 do not specify whether the unblock functions of rb_thread_call_without_gvl* can call rb_* or so, which would be a problem as those need to be entered in the context as almost all of them end up running some Sulong or Ruby code. It might hold though.

self-note: some cleanup at bd/system-thread-cext-interrupter. I can try if that'd work after https://github.com/oracle/truffleruby/pull/3356 is merged

eregon commented 10 months ago

I chatted with @chumer. There is no perfect solution right now, so we have to use an extra (polyglot) thread or use JNI and then expect the unblock function will not call back to the Ruby C API (if so we need to test how it fails when that does not hold, it would need to be a clear error and e.g. not a segfault). Longer-term there might be some kind of async-profiler for Truffle or the functionality to read the guest stacktrace of another thread without needing to run code on that thread, i.e., also working when that thread is in a blocking native call.

eregon commented 10 months ago

if so we need to test how it fails when that does not hold, it would need to be a clear error and e.g. not a segfault.

With asserts on it's rather clear:

$ jt test -e rb_thread_call_without_gvl spec/ruby/optional/capi/thread_spec.rb -- --cpusampler
Exception in thread "SIGHUP handler" java.lang.AssertionError: No polyglot context is entered. A language or context reference must not be used if there is no polyglot context entered.
    at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotFastThreadLocals.assertValidGet(PolyglotFastThreadLocals.java:481)
    at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotFastThreadLocals$ContextReferenceImpl.get(PolyglotFastThreadLocals.java:513)
    at com.oracle.truffle.truffle_nfi_libffi/com.oracle.truffle.nfi.backend.libffi.LibFFIContext.get(LibFFIContext.java:359)

but without it's not:

Exception in thread "SIGHUP handler" java.lang.NullPointerException: Cannot invoke "com.oracle.truffle.nfi.backend.libffi.LibFFIContext.executeNative(long, long, byte[], int, int[], Object[], byte[])" because "ctx" is null
    at com.oracle.truffle.truffle_nfi_libffi/com.oracle.truffle.nfi.backend.libffi.LibFFISignature$CachedSignatureInfo.execute(LibFFISignature.java:322)
    at com.oracle.truffle.truffle_nfi_libffi/com.oracle.truffle.nfi.backend.libffi.FunctionExecuteNode$SignatureExecuteNode.execute(FunctionExecuteNode.java:151)
    at org.graalvm.truffle/com.oracle.truffle.api.impl.DefaultCallTarget.callDirectOrIndirect(DefaultCallTarget.java:85)
    at org.graalvm.truffle/com.oracle.truffle.api.impl.DefaultDirectCallNode.call(DefaultDirectCallNode.java:59)

eregon commented 10 months ago

I tried creating a new thread in CExtInterrupter#interrupt but that does not work, because the context lock is held while the sampler ThreadLocalAction is running:

"Sampling thread" #34 [90231] daemon prio=5 os_prio=0 cpu=144.69ms elapsed=7.08s tid=0x00005627f2a4ff00 nid=90231 in Object.wait()  [0x00007f8304c2d000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait0(java.base@21.0.1/Native Method)
    - waiting on <0x00000007ffc1b470> (a com.oracle.truffle.polyglot.PolyglotThread)
    at java.lang.Object.wait(java.base@21.0.1/Object.java:366)
    at java.lang.Thread.join(java.base@21.0.1/Thread.java:2078)
    - locked <0x00000007ffc1b470> (a com.oracle.truffle.polyglot.PolyglotThread)
    at java.lang.Thread.join(java.base@21.0.1/Thread.java:2154)
    at org.truffleruby.core.thread.ThreadNodes$CallWithUnblockingFunctionNode$CExtInterrupter.interrupt(org.graalvm.ruby/ThreadNodes.java:722)
    at com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.setFastPendingAndInterrupt(org.graalvm.truffle/ThreadLocalHandshake.java:487)
    at com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.addHandshakeImpl(org.graalvm.truffle/ThreadLocalHandshake.java:474)
    at com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.addHandshake(org.graalvm.truffle/ThreadLocalHandshake.java:465)
    at com.oracle.truffle.api.impl.ThreadLocalHandshake.addHandshakes(org.graalvm.truffle/ThreadLocalHandshake.java:136)
    at com.oracle.truffle.api.impl.ThreadLocalHandshake.runThreadLocal(org.graalvm.truffle/ThreadLocalHandshake.java:125)
    at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:309)
    - locked <0x0000000418811aa0> (a com.oracle.truffle.polyglot.PolyglotContextImpl)
    at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:236)
    at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:232)
    at com.oracle.truffle.polyglot.EngineAccessor$EngineImpl.submitThreadLocal(org.graalvm.truffle/EngineAccessor.java:1812)
    at com.oracle.truffle.api.instrumentation.TruffleInstrument$Env.submitThreadLocal(org.graalvm.truffle/TruffleInstrument.java:1336)
    at com.oracle.truffle.tools.profiler.SafepointStackSampler.sample(com.oracle.truffle.tools.profiler/SafepointStackSampler.java:101)
    at com.oracle.truffle.tools.profiler.CPUSampler$SamplingTask.run(com.oracle.truffle.tools.profiler/CPUSampler.java:753)
    at java.util.concurrent.Executors$RunnableAdapter.call(java.base@21.0.1/Executors.java:572)
    at java.util.concurrent.FutureTask.runAndReset(java.base@21.0.1/FutureTask.java:358)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@21.0.1/ScheduledThreadPoolExecutor.java:305)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.1/ThreadPoolExecutor.java:1144)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.1/ThreadPoolExecutor.java:642)
    at java.lang.Thread.runWith(java.base@21.0.1/Thread.java:1596)
    at java.lang.Thread.run(java.base@21.0.1/Thread.java:1583)
    at com.oracle.truffle.polyglot.SystemThread.run(org.graalvm.truffle/SystemThread.java:68)

And the main thread ends up stuck, waiting for that lock:

"main" #1 [90109] prio=5 os_prio=0 cpu=6872.82ms elapsed=7.58s tid=0x00005627f242c340 nid=90109 waiting for monitor entry  [0x00007ffd2eb8e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:248)
    - waiting to lock <0x0000000418811aa0> (a com.oracle.truffle.polyglot.PolyglotContextImpl)
    at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:236)
    at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:232)
    at com.oracle.truffle.polyglot.EngineAccessor$EngineImpl.submitThreadLocal(org.graalvm.truffle/EngineAccessor.java:1812)
    at com.oracle.truffle.api.TruffleLanguage$Env.submitThreadLocalInternal(org.graalvm.truffle/TruffleLanguage.java:3729)
    at com.oracle.truffle.api.TruffleLanguage$Env.submitThreadLocal(org.graalvm.truffle/TruffleLanguage.java:3686)
    at org.truffleruby.language.SafepointManager.pauseAllThreadsAndExecute(org.graalvm.ruby/SafepointManager.java:52)
    at org.truffleruby.language.SafepointManager.pauseRubyThreadAndExecute(org.graalvm.ruby/SafepointManager.java:44)
    at org.truffleruby.core.thread.ThreadNodes$WakeupNode.wakeup(org.graalvm.ruby/ThreadNodes.java:640)

EDIT: it fixes the deadlock to not join() the thread there, so that's fine. Then it's the same failure as with JNI (wakeup raising ThreadError: killed thread)

eregon commented 10 months ago

I also tried just calling the unblock function via JNI (which would avoid the overhead of an extra thread and communicating with it for each sampling ThreadLocalAction). That seems to work for spec/ruby/optional/capi/thread_spec.rb, but the spec also fails as a side-effect with wakeup raising ThreadError: killed thread since the unblock already ran and so the thread already terminated. Which means CPUSampler has that side effect, unless the blocking function automatically retries after checking interrupts. Obviously, not great but not sure there is a solution. Also not clear to me if it's safe to call the unblock function multiple times, but it probably needs to be e.g. if the blocking function retries after checking interrupts (= rb_thread_check_ints()).

It'd be good to look at real-world usages of rb_thread_call_without_gvl() and see how it behaves after being interrupted. Ideally they would just retry after checking interrupts, but not sure that's what they do.

eregon commented 10 months ago

Some more thoughts here:

Maybe it would make sense to leaveAndEnter() for calling the blocking function. But since that may use rb_thread_call_with_gvl() and leaveAndEnter() does not support nested enter (IIRC) it's not an option. It would also of course not account the blocking functions at all which is not great.
We could disable safepoints while in the blocking function. This would again not account it, but then also there would be no need to interrupt it. But this wouldn't work for e.g. Ctrl+C/signal traps which need to run on that thread, so that's not an option either.
Maybe we could check if the CPUSampler is active and if so eagerly capture the guest stacktrace for that thread before executing the blocking function. It would have more overhead if there is 0 samples during that blocking function though (i.e., if it runs for less than 10ms).
The old CPUSampler with ShadowStack would work well for that case and effectively provide the ability to query another thread stack at any time without any synchronization with it or running any code on it. OTOH it added an overhead for every CallTarget call (in call mode). IMO it would be valuable to add it again as an option, we still have the ShadowStack impl anyway for other tools.
Obviously if there was a JVMCI API to get the guest stacktrace of another thread without needing to run code on it that'd be ideal, but it might take a while.

nirvdrum commented 10 months ago

Ideally, I don't think we'd have to deal with the unblock function at all. I suppose that's what you mean with an analog to async_profiler. I don't want collecting a profile to affect the behavior of running code beyond performance overhead. Interrupting a blocking call that would naturally unblock is not ideal.

With that said, I'd settle for whatever we can do. I'm finding the sampling profiler doesn't work with a real world Rails application. There's a lot of thread calls going on without the GVL in the database driver, OpenSSL, ZLib, and others.

eregon commented 10 months ago

Ideally, I don't think we'd have to deal with the unblock function at all. I suppose that's what you mean with an analog to async_profiler. I don't want collecting a profile to affect the behavior of running code beyond performance overhead. Interrupting a blocking call that would naturally unblock is not ideal.

Yes, exactly, ideally we wouldn't need to, but currently the only way to get a guest stacktrace of another thread is to run TruffleRuntime#iterateFrames on that thread, which means using TruffleSafepoint and reaching a guest safepoint reasonably quickly.

With that said, I'd settle for whatever we can do. I'm finding the sampling profiler doesn't work with a real world Rails application. There's a lot of thread calls going on without the GVL in the database driver, OpenSSL, ZLib, and others.

Note that many calls to rb_thread_call_without_gvl() use RUBY_UBF_IO as the unblock "function", which is sending SIGVTALRM to interrupt it and we already do this with JNI so that should work fine, except of course it also has the side effect of interrupting that blocking call (for every sample while in it) while it would not be interrupted in CRuby.

Ideally when such a blocking call is interrupted it would check interrupts (rb_thread_check_ints()) and then retry (e.g. until it doesn't get EINTR when using RUBY_UBF_IO). So for example it should not abort and not retry if the unblock function is called. But I'm not sure all calls to rb_thread_call_without_gvl() do this correctly. They should though because CRuby uses the unblock function e.g. to run a Ruby-defined signal handler on the main Ruby thread and that can happen multiple times and should have side effects on the program besides what the signal handler does.

eregon commented 9 months ago

Fix in https://github.com/oracle/truffleruby/pull/3385. I verified that with the fix truffleruby succeeds in capturing samples while a thread is inside rb_thread_call_without_gvl() with a custom unblock function. The profile for that thread is mostly attributed to rb_thread_check_ints()/rb_thread_call_with_gvl() because that's where the guest safepoint poll happens, which makes sense and is expected although the actual time is mostly spent in the blocking function but that itself can't be polled so that's expected safepoint bias. We still see those are above the rb_thread_call_without_gvl() in the flamegraph so it's rather clear.

eregon commented 9 months ago

Here is the flamegraph from the new spec for rb_thread_call_without_gvl() with a custom unblock function that I'm talking about above:

And before the fix (on truffleruby-dev native) it looked like this: flamegraph i.e. it could not get any sample while a thread was inside rb_thread_call_without_gvl()` with a custom unblock function.

oracle / truffleruby

[Bug / Concurrency] CPU sampler during multi threaded operation yields error: Context cannot be entered on system threads #3013