libretro / mupen64plus-libretro-nx

Improved mupen64plus libretro core reimplementation
GNU General Public License v2.0
212 stars 108 forks source link

libmali requires GL_USE_DLSYM hack #288

Open asdf288 opened 3 years ago

asdf288 commented 3 years ago

Hardware: armhf Amlogic S905x GPU: ARM Mali 450 OS: Lakka Video driver: libmali OpenGL driver, wayland variant (Retroarch in KMS mode)

This commit from October 2019 makes the core crash with a segmentation fault using the ARM Mali video drivers on my system. Compiling Running Retroarch in debug mode reveals a zero pointer exception. Reverting this specific commit and recompiling the core from git master makes it work again.

Other libretro cores that require OpenGL support (such as Parallel-N64) work fine. Using the Mesa3D OpenGL driver instead of libmali resolves the issue, but this driver is much slower.

It would be great if anyone could assist me in debugging this issue. I assume that this new way of retrieving the GL functions (glsm_get_proc_address instead of eglGetProcAddress) is incompatible with the libmali driver, but I lack the knowledge to find out what exactly is happening. I'll try to attach the debug output message of Retroarch here, even if it seems not very helpful to me in this particular case.

asdf288 commented 3 years ago

Here is the error message generated by Retroarch in debug mode:

[INFO] [Playlist]: Loading favorites file: [/storage/.config/retroarch/content_favorites.lpl].
[INFO] [GL]: VSync => on
[INFO] [Environ]: SYSTEM_DIRECTORY: "/tmp/system".
[libretro INFO] mupen64plus: Game controller 0 (Standard controller) has a Memory pak plugged in
[libretro INFO] mupen64plus: Game controller 1 (Standard controller) has nothing plugged in
[libretro INFO] mupen64plus: Game controller 2 (Standard controller) has nothing plugged in
[libretro INFO] mupen64plus: Game controller 3 (Standard controller) has nothing plugged in
[libretro INFO] mupen64plus: Using CIC type X102
[INFO] [Environ]: SYSTEM_DIRECTORY: "/tmp/system".
[INFO] [Environ]: SYSTEM_DIRECTORY: "/tmp/system".
[INFO] [Environ]: SYSTEM_DIRECTORY: "/tmp/system".
[INFO] [Environ]: SYSTEM_DIRECTORY: "/tmp/system".
AddressSanitizer:DEADLYSIGNAL
=================================================================
==908==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000 (pc 0x00000000 bp 0x00000000 sp 0xc72915a0 T0)
==908==Hint: pc points to the zero page.
==908==The signal is caused by a READ memory access.
==908==Hint: address points to the zero page.
    #0 0x0  (<unknown module>)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
==908==ABORTING
m4xw commented 3 years ago

Can you check if it takes the early return or if the frontend cb returns null? image

This is most likely a frontend ctx driver bug

m4xw commented 3 years ago

If the actual function call returns NULL, then i need to know which symbol. I will need a trace ideally if its a NULLptr GL func thats called so I can identify it.

asdf288 commented 3 years ago

Thank you for the instructions. I have never debugged a linux program before, are these steps that I followed correct?

The problem I'm having is that the system freezes as soon as I start the core in Retroarch, so the breakpoint is never reached. I tried without the breakpoint, but it's the same problem. I will try some more and post my results here.

m4xw commented 3 years ago

just:

gdb retroarch r --verbose <run as normal, wait till crash> info file info regs bt full

asdf288 commented 3 years ago

Sorry, it took a long time because I had to recompile Retroarch, the debugging symbols were missing. Here is the output of the gdb commands:

https://pastebin.com/7ASeAUre

By stepping through the code, I could see that the glsm_get_proc_address function went through all the symbols without errors. It then went to retroarch.c and there, in line 34612, the crash occurred. At least that's when gdb stopped and I could not issue any further step commands.

I will also try this setup on another single board computer I have that also uses the libmali driver. It might be helpful to see if it happens there, too.

m4xw commented 3 years ago

The glGetString ptr is null, thats bad. Try adding -DGL_USE_DLSYM to COREFLAGS for the platform and make clean

asdf288 commented 3 years ago

This lead to a compile error that RTLD_DEFAULT could not be found, I searched for this variable and found it defined in dlfcn.h. I added the include directive and recompiled, but the runtime error is exactly the same as before, the backtrace is identical as far as I can see.

asdf288 commented 3 years ago

Out of curiosity, where in the gdb output did you see that the pointer to glGetString is null? Is it just that symbol or are all pointers null?

I have tried the following: strings /usr/lib/libmali.so | grep glGetString This resulted in two matches, so at least this symbol exists somewere in the library

https://www.khronos.org/registry/OpenGL-Refpages/es2.0/xhtml/glGetString.xml

Could it be that some of these string values are simply not set in the driver library?

m4xw commented 3 years ago

1 0xe4701450 in opengl::FunctionWrapper::wrGetString

Without thr it will call ptrGetString, which is NULL for you

asdf288 commented 3 years ago

I suspected that the driver returns a zero value for GL_VERSION. I edited the source code here: https://git.libretro.com/libretro/mupen64plus-libretro-nx/-/blob/6ee49fd23804e15d7fa6d01e6e5b7f15da760056/GLideN64/src/Graphics/OpenGLContext/ThreadedOpenGl/opengl_Wrapper.cpp#L324 To make it return a fixed string "OpenGL ES 2.0". This seemed to work and the other GL strings were there, but then another error appeared in wrGetIntegerv, this time it was GL_MAJOR_VERSION that was missing. I did not follow this any further until I test another version of the libmali driver to see if that fixes it. Maybe something is wrong with this version and it does not correctly report its GL features when queried.

Thanks for your help so far m4xw, the instructions for the gdb commands were very useful.

asdf288 commented 3 years ago

Update: Recompiling the core with "-DGL_USE_DLSYM" did indeed work! Thank you. It just did not compile correctly on my first try. So to fix this issue, here are the steps needed:

Add the following include directive to GLFunctions.cpp: #include <dlfcn.h>

Set the following option before issuing make: CXXFLAGS += " -DGL_USE_DLSYM"

Note: Setting COREFLAGS would have been a nicer way to do it, but for some reason, this makes the compilation fail.

m4xw commented 3 years ago

I wonder why raspberry doesnt need the dlfcn include there, probably in some header hmmmmmm. Regarding COREFLAGS, you need to set it in your platform target in the makefile, where did u make the change before? While it can work passing it via the command, i highly prefer all those things explicitly stated in a segment of the makefile

asdf288 commented 3 years ago

So far, I tried make COREFLAGS="-DGL_USE_DLSYM" but this seems to break something in the Makefile, a lot of the usual flags are suddenly missing and compilation fails. I did not try to add it in the makefile directly yet. I'll do that now. (edit: this works!)

Also, I wasn't sure if the Makefile is the best place to put it, because this problem seems to be driver-specific, not platform specific, and the Makefile does not know which driver is actually being used. Many platforms have at least two possible OpenGL drivers (Mesa and another platform-specific driver).

A possible way could be to query pkgconfig, there's a file "glesv2.pc" that contains information about the OpenGL driver that is currently being used on the system. For example, the "Description" field is "Mali OpenGL ES 2.0" with libmali and "Mesa OpenGL ES 2.0" with Mesa.

m4xw commented 3 years ago

I tried make COREFLAGS="-DGL_USE_DLSYM" but this seems to break something in the Makefile, a lot of the usual flags are suddenly missing and compilation fails.

I never tried that.. so not supported :P

Also, I wasn't sure if the Makefile is the best place to put it, because this problem seems to be driver-specific, not platform specific, and the Makefile does not know which driver is actually being used. Many platforms have at least two possible OpenGL drivers (Mesa and another platform-specific driver).

Fwiw this could be a frontend issue in the context like I said, it's kinda what the hack works-around. It should work as long any shared library that exposes the gl funcs is loaded. I used to call it YOLO_DLSYM :P