visit-dav / visit

VisIt - Visualization and Data Analysis for Mesh-based Scientific Data
https://visit.llnl.gov
BSD 3-Clause "New" or "Revised" License
434 stars 111 forks source link

llnl lassen x11 forwarding to macOS fails to launch due to glew init #19418

Open cyrush opened 6 months ago

cyrush commented 6 months ago

Describe the bug

Lassen install is built with mesagl, here is the crash:

[cyrush@lassen709:~]$ visit -v 3.4.1
Running: gui3.4.1 -forceversion 3.4.1
Running: viewer3.4.1 -forceversion 3.4.1 -geometry 2208x1416+352+0 -borders 22,0,0,0 -shift 0,0 -preshift 0,22 -defer -host 127.0.0.1 -port 5600
Running: mdserver3.4.1 -forceversion 3.4.1 -host 127.0.0.1 -port 5601
2024-03-25 13:37:44.159 (   6.109s) [        121D3A00]vtkOpenGLRenderWindow.c:511    ERR| vtkGenericOpenGLRenderWindow (0x105cb7c0): GLEW could not be initialized: Missing GL version
2024-03-25 13:37:44.160 (   6.109s) [        121D3A00]vtkOpenGLRenderWindow.c:511    ERR| vtkGenericOpenGLRenderWindow (0x105cb7c0): GLEW could not be initialized: Missing GL version
VisIt's viewer exited abnormally! Aborting the Graphical User Interface. VisIt's developers may be reached via our GitHub discussions, https://github.com/visit-dav/visit/discussions

VisIt 3.3.3 works.

Helpful additional information

To Reproduce

Steps to reproduce the behavior. For example:

  1. Log into lassen with ssh -XY on macOS
  2. /usr/gapp/visit/bin/visit -v 3.4.1

Desktop

cyrush commented 6 months ago

A new glew init issue specific to VTK-9?

cyrush commented 6 months ago

I tried the following:

defaults write org.xquartz.X11 enable_iglx -bool true

And restarted xquartz, but this did not help.

markcmiller86 commented 6 months ago

Can you confirm its operation with...

[scratlantis:bssw.io/Articles/Blog] miller86% xdpyinfo | grep -i glx
    GLX
    SGI-GLX
cyrush commented 6 months ago

regardless of the enable_iglx setting -- GLX is enabled in XQuartz.

[harrison37@]$ cat   ~/Library/Logs/X11/org.xquartz.log
X11.app: main(): argc=2
    argv[0] = /Applications/Utilities/XQuartz.app/Contents/MacOS/X11.bin
    argv[1] = --listenonly
Waiting for startup parameters via Mach IPC.
X11.app: Listening on socket for fd handoff:  (5) /var/tmp/tmp.0.ivRhiy
X11.app: Thread created for handoff.  Returning success to tell caller to connect and push the fd.
X11.app Handing off fd to server thread via DarwinListenOnOpenFD(7)
DarwinListenOnOpenFD: 7
X11.app: do_start_x11_server(): argc=7
    argv[0] = /opt/X11/bin/Xquartz
    argv[1] = :0
    argv[2] = -nolisten
    argv[3] = tcp
    argv[4] = -iglx
    argv[5] = -auth
    argv[6] = /Users/harrison37/.serverauth.9517
[1483439.181] Xquartz starting:
[1483439.181] X.Org X Server 21.1.6
[1483439.183] x: 0, y: 0, w: 2560, h: 1416
[1483439.183] (II) Initializing extension Generic Event Extension
[1483439.183] (II) Initializing extension SHAPE
[1483439.183] (II) Initializing extension MIT-SHM
[1483439.183] (II) Initializing extension XInputExtension
[1483439.184] (II) Initializing extension BIG-REQUESTS
[1483439.184] (II) Initializing extension SYNC
[1483439.184] (II) Initializing extension XKEYBOARD
[1483439.184] (II) Initializing extension XC-MISC
[1483439.184] (II) Initializing extension SECURITY
[1483439.184] (II) Initializing extension XFIXES
[1483439.184] (II) Initializing extension RENDER
[1483439.184] (II) Initializing extension RANDR
[1483439.185] (II) Initializing extension DAMAGE
[1483439.185] (II) Initializing extension MIT-SCREEN-SAVER
[1483439.185] (II) Initializing extension DOUBLE-BUFFER
[1483439.185] (II) Initializing extension Present
[1483439.185] (II) Initializing extension X-Resource
[1483439.185] (II) Initializing extension XVideo
[1483439.185] (II) Initializing extension XVideo-MotionCompensation
[1483439.185] (II) Initializing extension GLX
[1483439.197] (II) GLX: Initialized Core OpenGL GL provider for screen 0
[1483439.247] [mi] mieq: warning: overriding existing handler 0x0 with 0x10040e07c for event 28
[1483439.247] X11.app: DarwinProcessFDAdditionQueue_thread: Sleeping to allow xinitrc to catchup.
[1483439.266] (EE) Error loading keymap /tmp/server-0.xkm
[1483439.266] (EE) XKB: Failed to load keymap. Loading default keymap instead.
[1483440.936] noPseudoramiXExtension=0, pseudoramiXNumScreens=1
[1483442.250] Calling ListenOnOpenFD() for new fd: 7
[1483442.251] getsockopt failed to determine pid of socket 16: Socket is not connected
[1483442.251] getsockopt failed to determine pid of socket 17: Socket is not connected
[1483448.764] noPseudoramiXExtension=0, pseudoramiXNumScreens=1

And I can successfully foward from poodle (llnl cz toss4) to my mac running XQuartz. This issue seems specific to lassen + blueos.

markcmiller86 commented 6 months ago

fyi, -cli -nowin works fine on lassen. Otherwise, I get same issue you do.

markcmiller86 commented 6 months ago

glxgears works fine for me from lassen

markcmiller86 commented 6 months ago

Viewer log is showing a SEGV but a lot of perhaps funky gl and opengl version stuff too.

A.viewer.5.vlog.gz

markcmiller86 commented 6 months ago

regardless of the enable_iglx setting -- GLX is enabled in XQuartz.

I think that is because we requested long ago that they don't deploy XQuartz here at LLNL without that...at least I hope it is.

markcmiller86 commented 6 months ago

Things I've tried with no luck...

env LIBGL_ALWAYS_SOFTWARE=1 /usr/gapps/visit/bin/visit -v 3.4.1 -noconfig

edit ~/.visit/config adding a line to force scalable rendering to always

env MESA_GL_VERSION_OVERRIDE=3.3 MESA_GLSL_VERSION_OVERRIDE=330 /usr/gapps/visit/bin/visit -v 3.4.1
markcmiller86 commented 6 months ago

Also tried

lassen708{miller86}1254: env LIBGL_DEBUG=verbose /usr/gapps/visit/bin/visit -v 3.4.1
Running: gui3.4.1 -forceversion 3.4.1
Running: viewer3.4.1 -forceversion 3.4.1 -geometry 1507x1176+413+0 -borders 22,0,0,0 -shift 0,0 -preshift 0,22 -defer -host 127.0.0.1 -port 5600
Running: mdserver3.4.1 -forceversion 3.4.1 -host 127.0.0.1 -port 5601
function is no-op
function is no-op
function is no-op
function is no-op
.
.
.
cyrush commented 6 months ago

We have rpath issues with our linux execs as well:

readelf  -d  /usr/gapps/visit/3.4.1/linux-intel/bin/viewer 

(Note: linux-intel == ibm p9 + blueos )

0x000000000000000f (RPATH)              Library rpath: [/usr/local/lib:/usr/workspace/wsa/visit/visit/thirdparty_shared/3.4.1/blueos/vtk/9.2.6/linux-ppc64le_gcc-8.3/lib64:/usr/workspace/wsa/visit/visit/thirdparty_shared/3.4.1/blueos/qt/5.14.2/linux-ppc64le_gcc-8.3/lib:/usr/workspace/wsa/visit/visit/thirdparty_shared/3.4.1/blueos/zlib/1.2.13/linux-ppc64le_gcc-8.3/lib:/usr/workspace/wsa/visit/visit/thirdparty_shared/3.4.1/blueos/mesagl/17.3.9/linux-ppc64le_gcc-8.3/lib:/usr/workspace/wsa/visit/visit/thirdparty_shared/3.4.1/blueos/llvm/6.0.1/linux-ppc64le_gcc-8.3/lib]

It doesn't really hurt to have these in there, but all users may not be able to read these files.

cyrush commented 5 months ago

3.4.1 client server will solve this when available for macOS, vnc is another option.

To explore further we could look at strace logs to verify the correct version of GL is loaded

cyrush commented 5 months ago

@cyrush see if this happens on toss4 system

markcmiller86 commented 5 months ago

This is the VTK code that is failing

void vtkOpenGLRenderWindow::OpenGLInitContext()
{
  this->ContextCreationTime.Modified();

  // When a new OpenGL context is created, force an update
  if (!this->Initialized)
  {
#ifdef GLEW_OK
    GLenum result = glewInit();
    this->GlewInitValid = (result == GLEW_OK);
    if (!this->GlewInitValid)
    {
      const char* errorMsg = reinterpret_cast<const char*>(glewGetErrorString(result));
      vtkErrorMacro("GLEW could not be initialized: " << errorMsg);
      return;
    }

    if (!GLEW_VERSION_3_2 && !GLEW_VERSION_3_1)
    {
      vtkErrorMacro("Unable to find a valid OpenGL 3.2 or later implementation. "
                    "Please update your video card driver to the latest version. "
                    "If you are using Mesa please make sure you have version 11.2 or "
                    "later and make sure your driver in Mesa supports OpenGL 3.2 such "
                    "as llvmpipe or openswr. If you are on windows and using Microsoft "
                    "remote desktop note that it only supports OpenGL 3.2 with nvidia "
                    "quadro cards. You can use other remoting software such as nomachine "
                    "to avoid this issue.");
      return;
    }
#else
    // GLEW is not being used, so avoid false failure on GL checks later.
    this->GlewInitValid = true;
#endif

I am a little confused by the GLEW_VERSION_ symbols because glew library versions currently go up to only 2.1. And, the subsequent error message is actually referring to GL versions which are indeed at 3.2.

I just looked at strace logs from VisIt and glxgears. Nothing stands out.

strace_visit.txt strace_glxgears.txt

lassen708{miller86}1366: grep libglapi *.out | grep -v NOENT
strace_glxgears.out:open("/usr/lib64/libglapi.so.0", O_RDONLY|O_CLOEXEC) = 4
strace_visit.out:1714656482.809508 open("/usr/workspace/wsa/visit/visit/thirdparty_shared/3.4.1/blueos/mesagl/17.3.9/linux-ppc64le_gcc-8.3/lib/libglapi.so.0", O_RDONLY|O_CLOEXEC) = 0 <0.000274>

I am a little surprised we're loading libs from the build point (e.g. /usr/workspace/wsa/visit) instead of the install point (e.g. /usr/gapps/visit). In fact, I see that is happening with a ton of the VTK libs.

I tried to LD_PRELOAD a different gl library The problem is definitely in VTK.

So, I looked at VTK build...

/usr/workspace/wsa/visit/visit/thirdparty_shared/3.4.1/blueos/vtk/9.2.6/linux-ppc64le_gcc-8.3/bin
lassen708{miller86}1332: ./vtkProbeOpenGLVersion-9.2
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
2024-05-02 06:33:51.619 (   0.944s) [         45C7A00]vtkOpenGLRenderWindow.c:524    ERR| vtkXOpenGLRenderWindow (0x100749b0):
Unable to find a valid OpenGL 3.2 or later implementation. Please update your video card driver to the latest version.
If you are using Mesa please make sure you have version 11.2 or later and make sure your driver in Mesa supports OpenGL 3.2 such as llvmpipe or openswr.
If you are on windows and using Microsoft remote desktop note that it only supports OpenGL 3.2 with nvidia quadro cards.
You can use other remoting software such as nomachine to avoid this issue.
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
cyrush commented 5 months ago

blueOS is unique in that I think it may have a system opengl, I suspect VTK is picking up traces of this and that is at odds with mesa gl. Still surprised that it is only triggered on macOS (same connection go VNC seems fine)