oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
2.98k stars 179 forks source link

Segfault in 24.1.0-dev GFTC builds with pg driver #3590

Closed nirvdrum closed 1 week ago

nirvdrum commented 2 weeks ago

While running the benchmarks from the ORM benchmarks discussion, I ran into a segfault using the latest 24.1.0-dev GFTC JVM builds. I haven't seen the issue with the GFTC native builds. The crash occurs 100% reliably on my Ubuntu 24.04 x86_64 system.

> ruby -v
truffleruby 24.1.0-dev-51b497f9, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]

Steps:

  1. Install the latest 24.1.0-dev GFTC build (24.1.0-ea10 at the moment)
  2. Clone the ORM benchmark repo
  3. cd activerecord_truffleruby
  4. bundle install
  5. Start the PostgreSQL container (either Docker or Podman)
  6. Set the DATABASE_URL environment variable to connect into the container (e.g., the value is postgres://postgres:postgres@localhost:36319/TestAR on my machine because local post 36319 forwards to 5432 in the container)
  7. Run ruby benchmark.rb

hs_err_pid700062.log

internal issue: [GR-54771]

eregon commented 2 weeks ago
Stack: [0x00007b9840af4000,0x00007b9840bf4000],  sp=0x00007b9840bf27a0,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libffi.so.8+0x383a]
C  [libtrufflenfi.so+0x723c]  Java_com_oracle_truffle_nfi_backend_libffi_ClosureNativePointer_freeClosure+0x6c
j  com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer.freeClosure(J)V+0 com.oracle.truffle.truffle_nfi_libffi
j  com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer$NativeDestructor.destroy()V+4 com.oracle.truffle.truffle_nfi_libffi
j  com.oracle.truffle.nfi.backend.libffi.NativeAllocation$1.run()V+22 com.oracle.truffle.truffle_nfi_libffi
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@23
j  java.lang.Thread.run()V+19 java.base@23
v  ~StubRoutines::call_stub 0x00007b9866d03ca6
V  [libjvm.so+0x8d8ebb]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x2db
V  [libjvm.so+0x8da822]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x1c2
V  [libjvm.so+0x9b22ac]  thread_entry(JavaThread*, JavaThread*)+0x8c
V  [libjvm.so+0x8ef3a8]  JavaThread::thread_main_inner() [clone .part.0]+0xb8
V  [libjvm.so+0xeab1df]  Thread::call_run()+0x9f
V  [libjvm.so+0xcc8095]  thread_native_entry(Thread*)+0xd5
C  [libc.so.6+0x9ca94]
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer.freeClosure(J)V+0 com.oracle.truffle.truffle_nfi_libffi
j  com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer$NativeDestructor.destroy()V+4 com.oracle.truffle.truffle_nfi_libffi
j  com.oracle.truffle.nfi.backend.libffi.NativeAllocation$1.run()V+22 com.oracle.truffle.truffle_nfi_libffi
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@23
j  java.lang.Thread.run()V+19 java.base@23
v  ~StubRoutines::call_stub 0x00007b9866d03ca6

So that sounds like an issue in TruffleNFI.

Could you try with 24.0.1 (JVM) too?

nirvdrum commented 2 weeks ago

I'm sorry. I had tested with 24.0.1 but forgot to note it. I'm only seeing the problem with the 24.1.0-dev GFTC JVM build. I don't see it with native builds and I don't see it with a CE JVM build. I also tried with the cext lock enabled and disabled -- that has no impact. The stack does look NFI related, but I wonder if it's something about the pg driver. I tried the sqlite3 benchmark and that didn't crash.

rschatz commented 2 weeks ago

I can reproduce this using your docker containers. Strangely enough I can't reproduce it on my host system.

I'm pretty sure the issue is that there is a second libffi coming from somewhere. The first one is statically linked into libtrufflenfi.so. Not sure where the second one comes from, this might just a transitive library dependency, either of hotspot or the postgres driver.

What's happening here is that the dynamic loader is confusing those two libraries, and it seems to be mixing symbols from them. E.g. use ffi_closure_allocate from our libffi, but ffi_closure_free from the other one. And that leads to the segfault.

I tried to rename all the libffi symbols in libtrufflenfi.so manually, and that seems to fix the issue. I'm not 100% sure how to actually do this without manually messing with the libtrufflenfi.so, but there has to be some way. objcopy --redefine-symbols unfortunately doesn't work, it renames only the static symbols, we need to rename the dynamic symbols.

nirvdrum commented 2 weeks ago

@rschatz Interesting. If it helps any, I'm seeing the crash when running on my Ubuntu 24.04 host. Is there something in particular I can search for that would help you see if it's a naming conflict?

rschatz commented 2 weeks ago

This was actually easier than I thought. Just adding -fvisibility=hidden to the libffi build fixes the problem, no need to actually rename any symbols.

I made a PR: https://github.com/oracle/graal/pull/9146 For convenience I made the PR based on the commit of the 24.1.0-ea10 build. If you want to try it out, you can just cd truffle; mx build, and swap out the libtrufflenfi.so in the GFTC build.

This fixes the problem for me on your containers.

nirvdrum commented 1 week ago

Thanks. I can confirm the process no longer segfaults.