jnr / jnr-ffi

Java Abstracted Foreign Function Layer
Other
1.23k stars 154 forks source link

JVM crashes on setting callback for GTK3 signals #281

Closed praj-foss closed 2 years ago

praj-foss commented 2 years ago

Hello there!

I'm currently learning JNR by trying out various Linux libraries, most recently GTK3. I used this example as a reference and wrote the new demo that can be found here. But it crashes badly when I try to run it (using ./gradlew gtk3:run). Here's the crash log: hs_err_pid7667.log. I use GraalVM 21.1.0 as my JDK 11, on a x86_64 Linux machine (opensuse tumbleweed). My installed GTK version is 3.24.30-2.3.

I can see that it crashes on line 31 of Gtk3App.java where I call from Java

lib.g_signal_connect_data(application, "activate", onActivate, null, null, 0);

The onActivate is a lambda looking like this:

LibGtk3.GCallback onActivate = (app, data) -> {
    var window = lib.gtk_application_window_new(app);
    var button = lib.gtk_button_new_wih_label("Click me");
    lib.gtk_container_add(window, button);
    lib.gtk_widget_show_all(window);
};

which is supposed to act like a function pointer similar to on_app_activate from my C reference:

// callback function which is called when application is first started
static void on_app_activate(GApplication *app, gpointer data) {
    // create a new application window for the application
    // GtkApplication is sub-class of GApplication
    // downcast GApplication* to GtkApplication* with GTK_APPLICATION() macro
    GtkWidget *window = gtk_application_window_new(GTK_APPLICATION(app));
    // a simple push button
    GtkWidget *btn = gtk_button_new_with_label("Click Me!");
    // connect the event-handler for "clicked" signal of button
    g_signal_connect(btn, "clicked", G_CALLBACK(on_button_clicked), NULL);
    // add the button to the window
    gtk_container_add(GTK_CONTAINER(window), btn);
    // display the window
    gtk_widget_show_all(GTK_WIDGET(window));
}

I also had a look at #231 and read the suggestions there to define onActivate as public static final variable, but it still didn't stop the crash. I don't have much idea about why it's crashing, my previous example seemed to work fine with callbacks. It might be an issue specific to GTK3 and its thread management or using GraalVM as JDK, but again I have zero ideas. Please try running the example if you're on a Linux machine and let me know where's the problem.

headius commented 2 years ago

I suspect there's an alignment or width issue with the arguments, but we need to dig deeper to know for sure.

Can you provide an example, perhaps as a small repository, that I can build and use to reproduce this?

praj-foss commented 2 years ago

Sure, you can see the repository at https://github.com/praj-foss/jnr-demo. The target code is present under the gtk3 directory, and you can try running it using ./gradlew gtk3:run.

Also, there are some changes: I used jnr.ffi.ObjectReferenceManager to store a pointer to my original callback (suggested by the discussion in #231), and used that to pass the function pointer to native methods. It still crashes but the stack trace is different now. Please have a look at the updated crash log: hs_err_pid21253.log

On a side note, I'm actually writing JNR examples for my blog and I'd be happy to contribute to the official docs/examples. Please let me know if I can be of any help.

headius commented 2 years ago

I have managed to reproduce on MacOS and @enebo is confirming that it reproduces on Linux.

If you are good with C libraries, getting a debug build of GTK3 and seeing where it segfaults would clearly be a great help.

On a side note, I'm actually writing JNR examples for my blog and I'd be happy to contribute to the official docs/examples. Please let me know if I can be of any help.

That would be fantastic! We do not get a lot of time to document the library, and our uses of JNR are pretty stable and do not require much maintenance so we rarely run into the edge cases users like you will see.

headius commented 2 years ago

Interestingly, setting the callback to null, so it would be passed in as a null pointer, produces a different result: gtk catches the null handler and asserts:

(process:76291): GLib-GObject-CRITICAL **: 15:31:39.822: g_signal_connect_data: assertion 'c_handler != NULL' failed

Seems to indicate that it is not necessarily the callback getting nulled out, since it should catch that. Bad memory location? Already collected and not honoring our attempts to keep the handler referenced?

headius commented 2 years ago

This investigation is hampered by the fact that it seems the g_closure_marshal_VOID__VOID function is generated code. Might need to loop in someone more familiar with GTK internals to get a good picture of what is happening here.

We have not had other reports of callbacks leading to SEGV so I am left speculating why this function seems to be getting a bad pointer.

headius commented 2 years ago

DIsabling jnr-ffi's x86_64 ASM generation does not appear to improve the situation, assuming it is being passed through.

However... I looked closer at the error dumps and I'm seeing RAX set this this implausible value:

RAX=0xcafebabe778d1062 is an unknown value

Unknown indeed. The hex cafebabe is used as the first four bytes of the Java .class format, but as far as I know it should not appear in any pointer references in memory. So this seems to be passing along some bogus data.

headius commented 2 years ago

This seems to be the source of the bogus pointer value:

https://github.com/jnr/jnr-ffi/blob/24c0aed3fa55117cef9311ab18881f135961bc2b/src/main/java/jnr/ffi/provider/DefaultObjectReferenceManager.java#L65-L67

I believe this would indicate that either the DefaultObjectReferenceManager is not working properly, or this code is not using it properly.

headius commented 2 years ago

@praj-foss Ok, this may be a flaw in how you are using the API, but I do not know enough about GTK to be certain.

I modified your final code to not use the pointer value returned, and it seems to get much further... far enough to trigger a different, probably MacOS-specific error:

diff --git a/gtk3/src/main/java/in/praj/demo/Gtk3App.java b/gtk3/src/main/java/in/praj/demo/Gtk3App.java
index 8195466..eaaf284 100644
--- a/gtk3/src/main/java/in/praj/demo/Gtk3App.java
+++ b/gtk3/src/main/java/in/praj/demo/Gtk3App.java
@@ -23,17 +23,18 @@ public class Gtk3App {
                 lib.gtk_get_major_version(), lib.gtk_get_minor_version(), lib.gtk_get_micro_version());

         var application = lib.gtk_application_new("in.praj.demo.Gtk3App", 0);
-        var onActivate = refs.add((LibGtk3.GCallback) (gobject, data) -> {
+        LibGtk3.GCallback callback = (gobject, data) -> {
             var window = lib.gtk_application_window_new(gobject);
             var button = lib.gtk_button_new_with_label("Click me");
             lib.gtk_container_add(window, button);
             lib.gtk_widget_show_all(window);
-        });
+        };
+        var callbackKey = refs.add(callback);

-        lib.g_signal_connect_data(application, "activate", onActivate, null, null, 0);
+        lib.g_signal_connect_data(application, "activate", callback, null, null, 0);
         lib.g_application_run(application, 0, null);

-        refs.remove(onActivate);
+        refs.remove(callbackKey);
         lib.g_object_unref(application);
     }
 }
diff --git a/gtk3/src/main/java/in/praj/demo/LibGtk3.java b/gtk3/src/main/java/in/praj/demo/LibGtk3.java
index 72e3e3a..1c5f7ab 100644
--- a/gtk3/src/main/java/in/praj/demo/LibGtk3.java
+++ b/gtk3/src/main/java/in/praj/demo/LibGtk3.java
@@ -13,7 +13,7 @@ public interface LibGtk3 {
     @u_int64_t long g_signal_connect_data(
             Pointer instance,
             String detailed_signal,
-            Pointer c_handler,
+            GCallback c_handler,
             Pointer data,
             Pointer destroy_data,
             int connect_flags);
> Task :gtk3:run FAILED
GTK version: 3.24.30
2021-11-22 19:46:32.589 java[81207:10304347] WARNING: NSWindow drag regions should only be invalidated on the Main Thread! This will throw an exception in the future. Called from (
        0   AppKit                              0x00007fff22d96ed1 -[NSWindow(NSWindow_Theme) _postWindowNeedsToResetDragMarginsUnlessPostingDisabled] + 352
        1   AppKit                              0x00007fff22d81aa2 -[NSWindow _initContent:styleMask:backing:defer:contentView:] + 1296
        2   AppKit                              0x00007fff22d8158b -[NSWindow initWithContentRect:styleMask:backing:defer:] + 42
        3   AppKit                              0x00007fff2308b83c -[NSWindow initWithContentRect:styleMask:backing:defer:screen:] + 52
        4   libgdk-3.0.dylib                    0x00000001026da4bb -[GdkQuartzNSWindow initWithContentRect:styleMask:backing:defer:screen:] + 59
        5   libgdk-3.0.dylib                    0x00000001026e7479 _gdk_quartz_display_create_window_impl + 1225
        6   libgdk-3.0.dylib                    0x00000001026c52ef gdk_window_new + 959
        7   libgtk-3.0.dylib                    0x000000012d202052 gtk_window_realize + 1010
        8   libgtk-3.0.dylib                    0x000000012cf38ec0 gtk_application_window_real_realize + 96
        9   libgobject-2.0.0.dylib              0x000000010276a325 _g_closure_invoke_va + 309
        10  libgobject-2.0.0.dylib              0x0000000102781202 g_signal_emit_valist + 1266
        11  libgobject-2.0.0.dylib              0x0000000102781d22 g_signal_emit + 130
        12  libgtk-3.0.dylib                    0x000000012d1de603 gtk_widget_realize + 291
        13  libgtk-3.0.dylib                    0x000000012d201641 gtk_window_show + 81
        14  libgobject-2.0.0.dylib              0x000000010276a096 g_closure_invoke + 278
        15  libgobject-2.0.0.dylib              0x0000000102780346 signal_emit_unlocked_R + 1110
        16  libgobject-2.0.0.dylib              0x000000010278181e g_signal_emit_valist + 2830
        17  libgobject-2.0.0.dylib              0x0000000102781d22 g_signal_emit + 130
        18  libgtk-3.0.dylib                    0x000000012d1ddd64 gtk_widget_show + 212
        19  ???                                 0x00000001027fd1e3 0x0 + 4336898531
)
2021-11-22 19:46:32.601 java[81207:10304347] *** Assertion failure in BOOL NSScreenConfigurationInvalidateIfNeededForReason(_NSScreenConfigurationUpdateReason)(), NSScreenConfiguration.m:464
2021-11-22 19:46:32.632 java[81207:10304347] *** Terminating app due to uncaught exception 'NSInternalInconsistencyException', reason: 'NSScreen reconfig must only happen on the main thread.'
*** First throw call stack:
(
        0   CoreFoundation                      0x00007fff205df1db __exceptionPreprocess + 242
        1   libobjc.A.dylib                     0x00007fff20318d92 objc_exception_throw + 48
        2   CoreFoundation                      0x00007fff20608352 +[NSException raise:format:arguments:] + 88
        3   Foundation                          0x00007fff214042ec -[NSAssertionHandler handleFailureInFunction:file:lineNumber:description:] + 166
        4   AppKit                              0x00007fff22efaae5 +[_NSScreenConfiguration invalidateConfigurationIfNeededForReason:] + 309
        5   AppKit                              0x00007fff22efa8e9 _NSApplicationInvalidateScreenConfigurationIfNeeded + 173
        6   AppKit                              0x00007fff22efa7f6 -[NSApplication(ScreenHandling) _reactToDockChanged] + 130
        7   AppKit                              0x00007fff22efa05b _NSCGSDockMessageReceive + 268
        8   HIToolbox                           0x00007fff287d1bb6 _ZL12DockCallbackjjPvS_ + 1987
        9   HIServices                          0x00007fff257fa1ee dockClientNotificationProc + 217
        10  SkyLight                            0x00007fff24d14e15 _ZN12_GLOBAL__N_123notify_datagram_handlerEj15CGSDatagramTypePvmS1_ + 1071
        11  SkyLight                            0x00007fff24d13018 CGSSnarfAndDispatchDatagrams + 716
        12  SkyLight                            0x00007fff24fb2e46 SLSGetNextEventRecordInternal + 278
        13  SkyLight                            0x00007fff24e08cf5 SLEventCreateNextEvent + 9
        14  HIToolbox                           0x00007fff287b7a4f _ZL38PullEventsFromWindowServerOnConnectionjhP17__CFMachPortBoost + 45
        15  HIToolbox                           0x00007fff287c3faf FlushSpecificEventsFromQueue + 52
        16  AppKit                              0x00007fff22d6b6e4 +[NSEvent _discardTrackingAndCursorEventsIfNeeded] + 459
        17  AppKit                              0x00007fff22d6a442 -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 81
        18  libgdk-3.0.dylib                    0x00000001026e23ea poll_func + 186
        19  libglib-2.0.0.dylib                 0x000000012d6ca361 g_main_context_iterate + 433
        20  libglib-2.0.0.dylib                 0x000000012d6ca466 g_main_context_iteration + 102
        21  libgio-2.0.0.dylib                  0x000000012d85ef5d g_application_run + 541
        22  ???                                 0x00000001027fd0d9 0x0 + 4336898265
        23  ???                                 0x000000011576c6c0 0x0 + 4655072960
        24  ???                                 0x000000011576c705 0x0 + 4655073029
        25  ???                                 0x0000000115763849 0x0 + 4655036489
        26  libjvm.dylib                        0x0000000106bb22fb _ZN9JavaCalls11call_helperEP9JavaValueRK12methodHandleP17JavaCallArgumentsP6Thread + 637
        27  libjvm.dylib                        0x0000000106bf4335 _ZL17jni_invoke_staticP7JNIEnv_P9JavaValueP8_jobject11JNICallTypeP10_jmethodIDP18JNI_ArgumentPusherP6Thread + 290
        28  libjvm.dylib                        0x0000000106bf710e jni_CallStaticVoidMethod + 383
        29  java                                0x00000001022a5bac JavaMain + 2732
        30  libsystem_pthread.dylib             0x00007fff2046d8fc _pthread_start + 224
        31  libsystem_pthread.dylib             0x00007fff20469443 thread_start + 15
)
libc++abi: terminating with uncaught exception of type NSException

From the very little I know about GUI development on MacOS, this appears to be a problem further down the pipeline when it attempts to actually display something.

Perhaps you can try my diff on Linux and see if it works better?

I believe the value returned by the DefaultObjectReferenceManager is intended to just be an opaque reference to the object value, not a new or better pointer to the object in question. In this case, the resulting value is a bogus pointer starting with "0xCAFEBABE" bytes, leading to the peculiar RAX I mentioned above.

praj-foss commented 2 years ago

So I tried the diff here on Linux and it does crash differently now: hs_err_pid5098.log. Unfortunately, I'm still pretty inexperienced in both GTK and C/C++, so I couldn't figure out much from the logs. I do believe it has something to do with how GTK and GObject-system work internally since the normal way of creating JNR callbacks works fine in simpler use-cases.

I went through the official hello-world example of gtk3 and found that I missed implementing G_APPLICATION macro, which is possibly affecting some runtime behaviour that might cause the issue:

app = gtk_application_new ("org.gtk.example", G_APPLICATION_FLAGS_NONE);
g_signal_connect (app, "activate", G_CALLBACK (activate), NULL);
status = g_application_run (G_APPLICATION (app), argc, argv);
g_object_unref (app);

From the docs:

PREFIX_OBJECT (obj), which returns a pointer of type PrefixObject. This macro is used to enforce static type safety by doing explicit casts wherever needed. It also enforces dynamic type safety by doing runtime checks.

I'll look into that soon and post an update.

praj-foss commented 2 years ago

I used the preprocessor output from gcc and added the necessary functions in LibGtk3. This still changes nothing apparently, and the program crashes just like before. I've pushed the latest changes in the demo repo.

// Before preprocessing
int status = g_application_run(G_APPLICATION(app), argc, argv);

// After preprocessing
int status = g_application_run(((((GApplication*) g_type_check_instance_cast ((GTypeInstance*) ((app)), ((g_application_get_type ())) )))), argc, argv);
public interface LibGtk3 {
    // ...
    @u_int64_t long g_application_get_type();
    Pointer g_type_check_instance_cast(Pointer inst, @u_int64_t long type);
}

// Inside main method
lib.g_application_run(
        lib.g_type_check_instance_cast(application, lib.g_application_get_type()), 0, null);

Now I'm pretty much clueless. The only I've not implemented is the pointer type-casting done by the macros, as I'm using the normal Pointer class as input/return type in my interface. But I'm not sure if that's supposed to make any difference since these are mostly opaque pointers.

enebo commented 2 years ago

I hate to chime in with this but WFM. If I add @headius diff gtk3:run will work for me on:

openjdk version "16.0.2" 2021-07-20
OpenJDK Runtime Environment Temurin-16.0.2+7 (build 16.0.2+7)
OpenJDK 64-Bit Server VM Temurin-16.0.2+7 (build 16.0.2+7, mixed mode, sharing)

I get a Click me button in a frame popping up on my screen.

I also got this to work with graalvm ce 21.2 (openjdk version "11.0.12" openjdk version "11.0.12" 2021-07-20). I am on Fedora Core 34.

@praj-foss Can you do two things: 1) update to latest version of graalvm. Let's just hope there is a bug in graal that was fixed. 2) Install openjdk and verify it fails on that VM.

enebo commented 2 years ago

@praj-foss Since I did not see 21.3 is out I will get that and see if it also works.

headius commented 2 years ago

I have pushed a branch with my change, which has been confirmed on @enebo's Fedora system and my MacOS system (the latter works after passing -XstartOnFirstThread).

https://github.com/headius/jnr-demo/tree/patched

At this point I don't see any bug on the jnr-ffi side. @praj-foss let us know if you are still unable to run this and we'll have a look at your latest error.

enebo commented 2 years ago

I also downloaded graal ce 21.1.0 and it works with @headius patch.

praj-foss commented 2 years ago

@headius @enebo I downloaded Graalvm 21.3 (JDK 11) and Temurin JDK 16.0.2 and tried to run the patched repo, but it's still crashing the same: hs_err_pid6764.log. I even tried the -XstartOnFirstThread arg. It's pretty clear now that something's wrong with my setup, but I don't have a clue where it might be bugging. So I guess it's okay to close the issue now. I'll try a system upgrade, and maybe run it on gtk4 and let you know how that goes. Can you suggest what else might fix this?

praj-foss commented 2 years ago

I tried running the app on two different machines: one with ubuntu 21.10 with openjdk 17, where it crashed similarly, and another with opensuse leap 15.2 with openjdk 11 and a slightly older gtk3 release, where it ran perfectly. I'm assuming something breaks on the new gtk3 release. So I'll close this issue for now. Thanks, everyone!

headius commented 2 years ago

@praj-foss Thanks for following up and figuring this out! Please let us know if you file an issue with the GTK folks because I'd like to know that we're not doing anything wrong. I assume they will have better luck investigating why it crashes at that particular point.

praj-foss commented 2 years ago

@headius Sure! I'd like to do some more research on it though I'm not a C/C++ dev at all. Can you tell me how to debug the JNR/native calls? I came across this article which described how to use gdb to debug JNI calls. But when I try to use it with my demo I only get warnings like this:

warning: Could not load shared library symbols for /tmp/jffi8423976058172872553.so

So what's the proper way to debug JNR here?

headius commented 2 years ago

For that we would need to build a jffi binary with debug symbols. I'm not sure if the build is set up for that but can look into it this week.

I will say that your crasher that fails inside jffi should probably still be treated as a bug. May be something about your platforms that jffi is not handling correctly.

headius commented 2 years ago

I believe this diff followed by running ant should get you a jffi binary that has debug symbols:

diff --git a/jni/GNUmakefile b/jni/GNUmakefile
index cfe570a..4a8a061 100755
--- a/jni/GNUmakefile
+++ b/jni/GNUmakefile
@@ -61,7 +61,7 @@ LIBNAME = jffi
 # Compiler/linker flags from:
 #   http://weblogs.java.net/blog/kellyohair/archive/2006/01/compilation_of_1.html
 JFLAGS = -fno-omit-frame-pointer -fno-strict-aliasing -DNDEBUG
-OFLAGS = -O2 $(JFLAGS)
+OFLAGS = -Og -g $(JFLAGS)

 # MacOS headers aren't completely warning free, so turn them off
 WERROR = -Werror

Could you open a new issue for the crash within JFFI itself? I believe this issue has been resolved by fixing the client code, but this other crasher is a new mystery.

praj-foss commented 2 years ago

@headius I've reopened this issue in JFFI. Check out: https://github.com/jnr/jffi/issues/118