jzy3d / jogl

Mirror of https://jogamp.org/cgit/jogl.git/
Other
6 stars 2 forks source link

Hang on macOS #28

Open jzy3d opened 1 year ago

jzy3d commented 1 year ago

Initially discussed here by Manu

Problem

The Java3D feature works only half the time now, but on Apple Silicon. Sometime it worked, sometimes not, without any error message. It simple hangs 50% chance. The bug is only with the native RETINA display, not with an external 5K Display.
A few users reported that Sweet Home 3D could hang too with JOGL v2.4.0-rc-20210111, but I didn't find where this deadlock could come from yet. It's probably bound to the modifications for macOS made last year (see this commit and this one). I was wondering if synchronizing on another existing lock (like the one returned by [Component#getTreeLock](https://docs.oracle.com/javase/1.5.0/docs/api/java/awt/Component.html#getTreeLock())) would help, but I didn't try yet (and my suggestion might be just stupid or nonsense).

Solution

Now that the hanging issue can be reproduced, it's much easier to try to find where it occurs in Sweet Home 3D and other JOGL programs. First, note that this issue happens also when the free version of Rectangle application is running.

After a few tests, I found it happened when a Canvas3D instance is removed from the hierarchy of its container. With this in mind, I let the GC handle container hierarchy cleanup automatically but this delayed only the bug which eventually happened later. It looked like that a patch in Sweet Home 3D wouldn't be enough, but that some changes were required in JOGL itself.

Then after a few refinements in the part of jogamp.opengl.macosx.cgl.MacOSXCGLContext class which handles GL layer detachment and destruction, I found that calling OSXUtil.RunOnMainThread in NSOpenGLImpl#release without waiting its completion (first parameter set to false instead of true) would fix the issue. Eureka!

But when testing under older macOS 10.9 and 10.13, I experienced a similar hanging bug when setting the first parameter of OSXUtil.RunOnMainThread to false even when BetterSnapTool or Rectangle don't run! As the recent changes in the MacOSXCGLContext class were mainly programmed for macOS 10.15 (see the bug #1398), I propose for the moment to keep wait parameter to true only for macOS versions < 10.15 (note that I also tried to simply ignore OSXUtil.RunOnMainThread for old macOS versions, because the call to CGL.setContextView(ctx, 0) that it makes didn't exist before the fix of the bug #1398 but this didn't work). This won't solve the hanging issue under older macOS versions, but at least, we can ask users to quit BetterSnapTool, Rectangle and the like under these macOS versions (or upgrade their system if they can), until we find a better solution.

Therefore, the current proposed change is to replace the statement:

OSXUtil.RunOnMainThread(true /* wait */, true /* kickNSApp */, new Runnable() {
@Override
public void run() {
CGL.setContextView(ctx, 0);
} } );

by (for your information, 10.16 version number is returned by Java 8):

boolean wait = System.getProperty("os.version").startsWith("10.") 
&& !System.getProperty("os.version").startsWith("10.15") 
&& !System.getProperty("os.version").startsWith("10.16");
OSXUtil.RunOnMainThread(wait /* wait */, true /* kickNSApp */, new Runnable() {
@Override
public void run() {
CGL.setContextView(ctx, 0);
} } );

You can test this solution with the modifications made to jogl-all.jar file available in the ZIP file jogl-all-2.4.0-rc-20221117.zip and also in SweetHome3D 7.0.2c where you can now import furniture without the hanging issue. I tested it under macOS 10.9, 10.13.6, 12.6.1 Intel, 13.0 ARM and will test it under other macOS versions in the coming days.

Next

I ran more tests this morning, and I confirm that setting the first parameter of OSXUtil.RunOnMainThread to false worked also in Sweet Home 3D for macOS 10.15, macOS 13 Intel and even for macOS 10.14, which is very good news because it's the last macOS version which supported 32 bit applications that some people may be obliged to keep. Therefore, this solution works for macOS versions from 10.14 to 13, but the test I proposed to add in NSOpenGLImpl#release must cite 10.14 too. The proposed change is finally to replace the following statement in jogamp.opengl.macosx.cgl.MacOSXCGLContext class:

OSXUtil.RunOnMainThread(true /* wait */, true /* kickNSApp */, new Runnable() {
@Override
public void run() {
CGL.setContextView(ctx, 0);
} } );

by:

String osVersion = System.getProperty("os.version");
boolean wait = osVersion.startsWith("10.") 
&& !osVersion.startsWith("10.14") 
&& !osVersion.startsWith("10.15") 
&& !osVersion.startsWith("10.16");
OSXUtil.RunOnMainThread(wait /* wait */, true /* kickNSApp */, new Runnable() {
@Override
public void run() {
CGL.setContextView(ctx, 0);
} } );

This solution is programmed in the jogl-all.jar file available in the ZIP file jogl-all-2.4.0-rc-20221118.zip and also in SweetHome3D 7.0.2d where you can now import furniture without the hanging issue. I also changed the Implementation-Version value to 2.4.0-rc-20221118 in the MANIFEST.MF file of jogl-all.jar to avoid any confusion.

mbastian commented 1 year ago

Hi folks, I have persistent issues in gephi with Mac OS hangs and I'm wondering why that might be the case for us but not for Sweet Home 3D or other JOGL applications. I'm using the same JOGL versions (rc-4 + this 20221118 hotfix). In any event, as the root cause is unknown this is, and probably will continue to be a big issue.

The hang always happen in a similar stacktrace than involve RunOnMainThread. The originating call might not always be the same but the most common seems to be at initialization CreateNSWindow.

"AWT-EventQueue-0" #23 prio=6 os_prio=31 cpu=957.36ms elapsed=53.53s tid=0x0000000130cc7000 nid=0x11503 in Object.wait()  [0x000000029c068000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(java.base@11.0.17/Native Method)
    - waiting on <no object reference available>
    at java.lang.Object.wait(java.base@11.0.17/Unknown Source)
    at jogamp.nativewindow.macosx.OSXUtil.RunOnMainThread(OSXUtil.java:318)
    - waiting to re-lock in wait() <0x00000003fe38d638> (a java.lang.Object)
    at jogamp.nativewindow.macosx.OSXUtil.CreateNSWindow(OSXUtil.java:161)
    at jogamp.nativewindow.jawt.macosx.MacOSXJAWTWindow.lockSurfaceImpl(MacOSXJAWTWindow.java:319)
    at com.jogamp.nativewindow.awt.JAWTWindow.lockSurface(JAWTWindow.java:677)
    at com.jogamp.opengl.awt.GLCanvas.createJAWTDrawableAndContext(GLCanvas.java:718)
    at com.jogamp.opengl.awt.GLCanvas.addNotify(GLCanvas.java:621)
jzy3d commented 1 year ago

Hi @mbastian

By reading the Gephi discussion I understand that

the MacOSXCGLContext patch suggested by Emmanuel does not consider version 12 has a target for changing the blocking condition of OSXUtil, neither checks if CPU is M1 or Intel. Not clear to me but maybe we are in un uncovered case.

Do not forget that Emmanuel encountered such hang when a Java application is ran concurrently with other natives applications (see discussion above).

The RunOnMainThread method discussed above has an additional flag named kickNSApp that I presume would try to kick a native application already using the macOS main thread. It would be worth investigating the effect of changing it.

using this OSXUtil method in PanamaGL as well, I’ll try to dig in the reasons for deadlock (which also occur when I keep the wait to true on 10.15+Intel - without using any other class of JOGL.

This would be helpful to discuss this issue with Apple developers to verify what may be wrong with the RunOnMainThread approach.

mbastian commented 1 year ago

Thanks @jzy3d for the additional context. As you may have read it on the thread, I've had this questionnaire to users about this hang issue :) Although this is based on the 0.9.7 version, which is prior to my fix attempt it showed me that it's encountered in the wild by various Mac OS versions. See below the results (N=16):

Screenshot 2023-01-01 at 18 10 21

I also know that most users experience it at startup. I've never been able to reproduce this situation myself. I only got it to hang when switching perspective in Gephi, which is equivalent to hide and the re-show the GLCanvas.

Correct me if I'm wrong, but I though Emmanuel's patch was purely focused on the release call. Here it's like the same hang but happening on other calls like CreateNSWindow. It's still possible that these are completely separated issue but it's suspicious enough to me. In any event, the patch from Emmanuel wouldn't help at this point.

Do not forget that Emmanuel encountered such hang when a Java application is ran concurrently with other natives applications (see discussion above).

The RunOnMainThread method discussed above has an additional flag named kickNSApp that I presume would try to kick a native application already using the macOS main thread. It would be worth investigating the effect of changing it.

Ok I think I've somehow missed that. I'll look into that more closely in my debugging.

jzy3d commented 1 year ago

That's great to have such statistics!

Correct me if I'm wrong, but I though Emmanuel's patch was purely focused on the release call. Here it's like the same hang but happening on other calls like CreateNSWindow. It's still possible that these are completely separated issue but it's suspicious enough to me. In any event, the patch from Emmanuel wouldn't help at this point.

Yes, you are right.

It is worth knowing OSXmisc, the macos native source code file backing OSXUtil.java. Maybe from there we can identify AppKit functions and read their doc to verify if they're properly used.

I notified Emmanuel about this discussion, he will probably have good advices.

jzy3d commented 1 year ago

Hi @mbastian ,

Here is an idea : the common point between init and deletion of the context is the use of RunOnMainThread. Both cases are initially triggered with the wait flag to true, meaning both method will block until the main macOS thread achieves the tasks.

What you could try would be to apply here the patch that Emmanuel applied here, where the patch is as follow (nb : I send the location of Emmanuel patch but none of us actually commited the change here)

String osVersion = System.getProperty("os.version");
boolean wait = osVersion.startsWith("10.") 
    && !osVersion.startsWith("10.14") 
    && !osVersion.startsWith("10.15") 
    && !osVersion.startsWith("10.16");
OSXUtil.RunOnMainThread(wait /* wait */, true /* kickNSApp */, new Runnable() {
    @Override
    public void run() {
        [...]
    } } );

The hard point then becomes to customize JOGL without rebuilding everything. Let me know if you can achieve this!

jzy3d commented 1 year ago

Hi @mbastian ,

Another thought about this problem : in JOGL, OSXUtil allows dispatching OpenGL queries to the macOS main thread. This avoid the software user to run his program with -XStartOnMainThread option.

Since CreateNSWindow method also make use OSXUtil.RunOnMainThread, we can assume that the thread dispatch may be a culprit and try to avoid such dispatch that causes trouble to Emmanuel.

For Gephi user experiencing a hang, I would try to run with the VM arg -XStartOnMainThread on macOS. In this case, the app will be initialized on macOS main thread, and the OSXUtil.RunOnMainThread method will simply execute tasks without invoking suspicious objective-C API.

mbastian commented 1 year ago

Thanks @jzy3d, for now I haven't been looking at changing JOGL code but rather see if I'm doing something special/wrong or find a workaround. It's still an option but higher effort as I would need to get familiar with that process, which doesn't look easy. Obviously if everything else fails I would need to get started on that :)

For Gephi user experiencing a hang, I would try to run with the VM arg -XStartOnMainThread on macOS. In this case, the app will be initialized on macOS main thread, and the OSXUtil.RunOnMainThread method will simply execute tasks without invoking suspicious objective-C API.

I did found this as well and tried it but in that case my application didn't even boot so I abandoned. But I can try again. I thought this was reserved for SWT apps but I'm Swing / Netbeans Platform.

jzy3d commented 1 year ago

@mbastian I am sorry, the exact VM arg is -XstartOnFirstThread, but you probably found it?

mbastian commented 1 year ago

@mbastian I am sorry, the exact VM arg is -XstartOnFirstThread, but you probably found it?

Yes, that's what I'm using. I do find reports on the web that Swing + this VM arg isn't working but I'm digging deeper.

jzy3d commented 1 year ago

After giving a try on a panama stub, I also notice that this arg makes swing hang. Removing SwingUtilities.invokeLater helps going further when this arg is here but did not completely solved the non displayed window.

mbastian commented 1 year ago

After giving a try on a panama stub, I also notice that this arg makes swing hang. Removing SwingUtilities.invokeLater helps going further when this arg is here but did not completely solved the non displayed window.

Yes, I gave up after reading a bit and seeing that this isn't gonna work with my full Swing application. I searched Netbeans repo for this parameter and found zero hits, which probably means it's a stupid thing to do...

EDIT: Found this: