Closed nirvdrum closed 1 week ago
Stack: [0x00007b9840af4000,0x00007b9840bf4000], sp=0x00007b9840bf27a0, free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libffi.so.8+0x383a]
C [libtrufflenfi.so+0x723c] Java_com_oracle_truffle_nfi_backend_libffi_ClosureNativePointer_freeClosure+0x6c
j com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer.freeClosure(J)V+0 com.oracle.truffle.truffle_nfi_libffi
j com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer$NativeDestructor.destroy()V+4 com.oracle.truffle.truffle_nfi_libffi
j com.oracle.truffle.nfi.backend.libffi.NativeAllocation$1.run()V+22 com.oracle.truffle.truffle_nfi_libffi
j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@23
j java.lang.Thread.run()V+19 java.base@23
v ~StubRoutines::call_stub 0x00007b9866d03ca6
V [libjvm.so+0x8d8ebb] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x2db
V [libjvm.so+0x8da822] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x1c2
V [libjvm.so+0x9b22ac] thread_entry(JavaThread*, JavaThread*)+0x8c
V [libjvm.so+0x8ef3a8] JavaThread::thread_main_inner() [clone .part.0]+0xb8
V [libjvm.so+0xeab1df] Thread::call_run()+0x9f
V [libjvm.so+0xcc8095] thread_native_entry(Thread*)+0xd5
C [libc.so.6+0x9ca94]
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer.freeClosure(J)V+0 com.oracle.truffle.truffle_nfi_libffi
j com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer$NativeDestructor.destroy()V+4 com.oracle.truffle.truffle_nfi_libffi
j com.oracle.truffle.nfi.backend.libffi.NativeAllocation$1.run()V+22 com.oracle.truffle.truffle_nfi_libffi
j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@23
j java.lang.Thread.run()V+19 java.base@23
v ~StubRoutines::call_stub 0x00007b9866d03ca6
So that sounds like an issue in TruffleNFI.
Could you try with 24.0.1 (JVM) too?
I'm sorry. I had tested with 24.0.1 but forgot to note it. I'm only seeing the problem with the 24.1.0-dev GFTC JVM build. I don't see it with native builds and I don't see it with a CE JVM build. I also tried with the cext lock enabled and disabled -- that has no impact. The stack does look NFI related, but I wonder if it's something about the pg driver. I tried the sqlite3 benchmark and that didn't crash.
I can reproduce this using your docker containers. Strangely enough I can't reproduce it on my host system.
I'm pretty sure the issue is that there is a second libffi coming from somewhere. The first one is statically linked into libtrufflenfi.so
. Not sure where the second one comes from, this might just a transitive library dependency, either of hotspot or the postgres driver.
What's happening here is that the dynamic loader is confusing those two libraries, and it seems to be mixing symbols from them. E.g. use ffi_closure_allocate
from our libffi, but ffi_closure_free
from the other one. And that leads to the segfault.
I tried to rename all the libffi symbols in libtrufflenfi.so
manually, and that seems to fix the issue. I'm not 100% sure how to actually do this without manually messing with the libtrufflenfi.so
, but there has to be some way. objcopy --redefine-symbols
unfortunately doesn't work, it renames only the static symbols, we need to rename the dynamic symbols.
@rschatz Interesting. If it helps any, I'm seeing the crash when running on my Ubuntu 24.04 host. Is there something in particular I can search for that would help you see if it's a naming conflict?
This was actually easier than I thought. Just adding -fvisibility=hidden
to the libffi build fixes the problem, no need to actually rename any symbols.
I made a PR: https://github.com/oracle/graal/pull/9146
For convenience I made the PR based on the commit of the 24.1.0-ea10 build. If you want to try it out, you can just cd truffle; mx build
, and swap out the libtrufflenfi.so
in the GFTC build.
This fixes the problem for me on your containers.
Thanks. I can confirm the process no longer segfaults.
While running the benchmarks from the ORM benchmarks discussion, I ran into a segfault using the latest 24.1.0-dev GFTC JVM builds. I haven't seen the issue with the GFTC native builds. The crash occurs 100% reliably on my Ubuntu 24.04 x86_64 system.
Steps:
cd activerecord_truffleruby
bundle install
DATABASE_URL
environment variable to connect into the container (e.g., the value ispostgres://postgres:postgres@localhost:36319/TestAR
on my machine because local post 36319 forwards to 5432 in the container)ruby benchmark.rb
hs_err_pid700062.log
internal issue:
[GR-54771]