segmentation violation on unix port

uraich commented 3 years ago

When trying to use the SDL driver on the unix port I get a segmentation violation: import SDL SDL.init() crashes.

embeddedt commented 3 years ago

Do you have an Nvidia graphics card in your computer? The Unix port has issues with Nvidia cards which we haven't been able to track down. See #46.

amirgon commented 3 years ago

Hi @uraich ! Since we are unable to reproduce your problem on our side, we would need your help debugging this.
Could you provide the stack trace of the crash? You can obtain it by running Micropython under gdb. Something like gdb --args micropython ...

uraich commented 3 years ago

I think embeddedt gave the answer: I do have an NVidia graphics card

amirgon commented 3 years ago

I think embeddedt gave the answer: I do have an NVidia graphics card

@embeddedt Do you have an NVidia graphics card? Would you consider diving into this once again?

I can suggest the following:

Run it with valgrind, perhaps there is some memory corruption
Try to obtain the sources or at least the debug symbols of libnvidia-glcore and get a more meaningful stack trace than this one
Try to ask on nvidia forums, or contact nvidia support
Open a ticket on nvidia issue tracker
Just for the test, we can try to change the SDL driver back to use SDL thread instead of Micropython thread. I believe the original problem was related to callbacks (we want to run Micropython callback on Micropython thread) so if this is the issue we can still use SDL thread but be carefully trigger callbacks from Micropython thread.

uraich commented 3 years ago

This is what I see when I run lv_micropython in gdb

embeddedt commented 3 years ago

Is this the same sequence of steps you ran to get it to segfault? It doesn't appear to have crashed yet.

uraich commented 3 years ago

Yes, the same sequence. Without gdb I see this:

amirgon commented 3 years ago

@uraich SIGUSR1 is used internally in lv_micropython and should be ignored. Please run in gdb (before run):

handle SIGUSR1 nostop noprint pass

uraich commented 3 years ago

Correct!, So it is in the nvidia-glcore library

embeddedt commented 3 years ago

@amirgon Yes; I have an Nvidia card.

I've been reading some documents about SDL, and it appears that in order to be compliant with its requirements, we need to ensure that all SDL rendering is handled on our initial main thread. It appears that calling SDL functions from other threads is known to cause issues.

Is SDL always invoked from a specific thread, or can it be invoked by any thread depending on what MicroPython is doing?

amirgon commented 3 years ago

we need to ensure that all SDL rendering is handled on our initial main thread

@embeddedt Do you mean, from the same thread that initialized SDL?

Is SDL always invoked from a specific thread, or can it be invoked by any thread depending on what MicroPython is doing?

I think that SDL is initialized and rendered from the same thread all the time.

Here is how it works:

mp_init_SDL is called from Micropython main thread when we call SDL.init() from Micropython. It calls monitor_init and initializes SDL.

https://github.com/lvgl/lv_binding_micropython/blob/6e2af53e9dc3042dfa6f9cd03e0b5c4ca7042d51/driver/SDL/modSDL.c#L73

mp_init_SDL creates a new thread tick_thread, but this thread does not do the rendering directly. It only schedules a call to Micropython:

https://github.com/lvgl/lv_binding_micropython/blob/6e2af53e9dc3042dfa6f9cd03e0b5c4ca7042d51/driver/SDL/modSDL.c#L33-L45

When Micropython is ready it performs scheduled tasks and calls mp_lv_task_handler which performs LVGL and SDL rendering:

https://github.com/lvgl/lv_binding_micropython/blob/6e2af53e9dc3042dfa6f9cd03e0b5c4ca7042d51/driver/SDL/modSDL.c#L23-L28

There is an open question here.

When Micropython performs scheduled tasks, is it doing it always from the same thread? I think it is... but just to make sure it's worth adding some printing of Thread-ID to mp_lv_task_handler.

Looking at Micropython code, it's not entirely clear.
mp_handle_pending is the function in Micropython that runs scheduled tasks, but it is called in different places, specifically by MP_HAL_RETRY_SYSCALL which itself is also called in different places

embeddedt commented 3 years ago

Do you mean, from the same thread that initialized SDL?

Unfortunately, it's even stricter than that. It looks like SDL operations always need to be done on the initial main thread (i.e. the one which main(argc, argv) runs in at the start of the program). Doing them on a single thread consistently isn't enough.

Is the "Micropython main thread" the same thread as main(argc, argv), or does MicroPython spawn its own thread initially and use that for the rest of the program's lifetime?

amirgon commented 3 years ago

Is the "Micropython main thread" the same thread as main(argc, argv), or does MicroPython spawn its own thread initially and use that for the rest of the program's lifetime?

Looking at main.c I don't see any explicit creation of a new thread. Also in the stack trace above it's clear that the call to SDL refresh is from the same thread main was invoked.

But to make sure, I suggest printing thread-id and checking if it's the same even when the problem happens.

amirgon commented 3 years ago

Another idea -
@uraich - Could you try running it with gdb again until it crashes, and show the stack trace of all threads? We would be able to tell if there are other threads in the process and what they are doing.

gdb command:

thread apply all bt

embeddedt commented 3 years ago

No time to debug this right now, but assuming that Thread 1 is the main thread, it looks like we aren't violating any SDL requirements.

Thread 2 (Thread 0x7fffeffed700 (LWP 8085)):
#0  0x00007ffff7bc7c70 in __GI___nanosleep (
    requested_time=requested_time@entry=0x7fffeffece60, 
    remaining=remaining@entry=0x7fffeffece50)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x00007ffff71afad5 in SDL_Delay_REAL (ms=<optimized out>)
    at /tmp/SDL2-2.0.10/src/timer/unix/SDL_systimer.c:211
#2  0x0000555555653129 in ?? ()
#3  0x00007ffff71134ac in SDL_RunThread (data=0x555555d1d7e0)
    at /tmp/SDL2-2.0.10/src/thread/SDL_thread.c:283
#4  0x00007ffff71aa0a9 in RunThread (data=<optimized out>)
    at /tmp/SDL2-2.0.10/src/thread/pthread/SDL_systhread.c:79
#5  0x00007ffff7bbd6db in start_thread (arg=0x7fffeffed700)
    at pthread_create.c:463
#6  0x00007ffff6dc1a3f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff7faf740 (LWP 8081)):
#0  0x00007ffff1fe1447 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
#1  0x00007ffff1fe1ac3 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
--Type <RET> for more, q to quit, c to continue without paging--
#2  0x00007ffff1fb9c9e in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
#3  0x00007ffff1fc7e9b in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
#4  0x00007ffff1fd10dc in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
#5  0x00007ffff70f1d40 in GL_RunCommandQueue (renderer=0x555555c80d40, cmd=0x555555d1a550, vertices=0x555555d1a590, vertsize=<optimized out>) at /tmp/SDL2-2.0.10/src/render/opengl/SDL_render_gl.c:1270
#6  0x00007ffff70e9e11 in FlushRenderCommands (renderer=0x555555c80d40) at /tmp/SDL2-2.0.10/src/render/SDL_render.c:218
#7  SDL_RenderPresent_REAL (renderer=0x555555c80d40) at /tmp/SDL2-2.0.10/src/render/SDL_render.c:3130
#8  0x0000555555652f37 in ?? ()
#9  0x0000555555653169 in ?? ()
#10 0x00005555555b7954 in ?? ()
#11 0x00005555555b8d7a in ?? ()
#12 0x00005555555b8e90 in ?? ()
#13 0x00005555555d4d51 in ?? ()
#14 0x000055555565399a in ?? ()
#15 0x00005555555d4935 in ?? ()
#16 0x00007ffff6cc1b97 in __libc_start_main (main=0x5555555a537b <main>, argc=1, argv=0x7fffffffdbe8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdbd8)
    at ../csu/libc-start.c:310
#17 0x00005555555a53da in ?? ()

uraich commented 3 years ago

Stupid question: If it was a problem from which thread SDL_init is called, should we then not have the same problem independently of the display driver? When I run my nvidia graphics cards with the nouveau driver, which works quite ok now, then the problem is gone.

embeddedt commented 3 years ago

@uraich Thanks for testing that. This proves that the problem is likely to be somewhere in the Nvidia proprietary drivers.

If it was a problem from which thread SDL_init is called, should we then not have the same problem independently of the display driver?

Not necessarily, because SDL is a thin layer over driver-specific implementations, each of which may have their own threading constraints.

stale[bot] commented 3 years ago

This issue or pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

amirgon commented 3 years ago

I have an idea.

By default, the SDL driver creates a "tick" thread that calls lv_tick_inc and schedules a call to lv_task_handler. Let's check if this crash is related to this thread or not. We discussed this above but I don't think we tried to completely disable this thread.

On latest version of the SDL driver you can pass an optional parameter auto_refresh. It's True by default, but if set it to False it would not create the "tick" thread and the user would be responsibile to call lv_tick_inc and lv_task_handler. These can simply be called in a loop, or more sensibly as part of uasyncio event loop.

Something like this:

import uasyncio
from async_utils import lv_async
import lvgl as lv
import SDL
lv.init()

# Register SDL display driver, without the event loop (auto_refresh set to False)

SDL.init(auto_refresh=False)

disp_buf1 = lv.disp_buf_t()
buf1_1 = bytes(480 * 10)
disp_buf1.init(buf1_1, None, len(buf1_1)//4)
disp_drv = lv.disp_drv_t()
disp_drv.init()
disp_drv.buffer = disp_buf1
disp_drv.flush_cb = SDL.monitor_flush
disp_drv.hor_res = 480
disp_drv.ver_res = 320
disp_drv.register()

# Regsiter SDL mouse driver

indev_drv = lv.indev_drv_t()
indev_drv.init() 
indev_drv.type = lv.INDEV_TYPE.POINTER
indev_drv.read_cb = SDL.mouse_read
indev_drv.register();

# Create a screen

scr = lv.obj()
btn = lv.btn(scr)
btn.align(scr, lv.ALIGN.CENTER, 0, 0)
label = lv.label(btn)
label.set_text('Hello World!')
lv.scr_load(scr)

# Event loop

lva = lv_async(refresh_func = SDL.refresh)
uasyncio.Loop.run_forever()

Could someone with Nvidia graphics card check if the problem happens with the code above? If it doesn't - it would mean that the tick thread is very probably related to the problem.

embeddedt commented 3 years ago

@amirgon I've tried this and unfortunately the crash still happens.

amirgon commented 3 years ago

@amirgon I've tried this and unfortunately the crash still happens.

What about the regular (non-micropython) SDL driver? Does it crash sometimes with Nvidia? Or is this problem limited to the Micropython version of the SDL driver?

embeddedt commented 3 years ago

The normal SDL driver used by the PC simulator has never crashed for me.

amirgon commented 3 years ago

The normal SDL driver used by the PC simulator has never crashed for me.

That's interesting because originally (long time ago) the Micropython's SDL driver was derived from the "normal" SDL driver.
So either there is some difference between them, or something in Micropython itself it causing the problem.

Here are possible ways to check this:

Diff the Micropython SDL driver vs. the "normal" SDL driver, maybe something would pop up
Run the "normal" SDL with Micropython instead of the Micropython SDL driver. Eventually Micropython is a C program so it's possible to "hard-wire" it to the "normal" SDL driver and see what happens.

embeddedt commented 3 years ago

I swapped out the current MicroPython SDL driver for a copy of the PC one (with minor modifications) and the issue is still happening. This suggests that MicroPython is interfering with SDL's operation. The two drivers look quite similar still so I doubt the issue is in the driver.

amirgon commented 3 years ago

This suggests that MicroPython is interfering with SDL's operation

This is very strange. Micropython core does not do anything related to SDL or any kind of graphics. It's just a console application.

Does it happen on SDL.init()? Or later?
If it happens on SDL initialization I suggest verifying this again by taking a fresh upstream Micropython without any LVGL or Bindings code, and only add the few lines of SDL initialization code directly to the unix port "main" function, just to prove that Micropython is the culprit and there's nothing related to LVGL the Bindings or the SDL driver.

Another thing worth trying (with fresh micropython + SDL init, or with the uasyncio script above) is turning off multi-threading on Micropython. I remember there were some limitations to the SDL driver related to threads, worth checking if it's related. To turn off multi-threading set MICROPY_PY_THREAD to 0 on mpconfigport.mk, (probably will also work by simply building the unix port with make -C ports/unix MICROPY_PY_THREAD=0)

embeddedt commented 3 years ago

Results:

It does not happen on upstream MicroPython when added to main.
It also does not happen on lv_micropython when added to main in the same spot.

I will try the threading suggestion and see what happens.

amirgon commented 3 years ago

It also does not happen on lv_micropython when added to main in the same spot.

Does it happen when adding it in lv_init() C code instead?

embeddedt commented 3 years ago

@amirgon I was able to reproduce it on upstream MicroPython by adding in the initial calls to render the first frame (gray background).

The following test case works in a standalone C file, but fails in MicroPython's main function:

#define MONITOR_HOR_RES 480
#define MONITOR_VER_RES 272
#define MONITOR_ZOOM 1

    SDL_Init(SDL_INIT_VIDEO);

    SDL_Window * window = SDL_CreateWindow("TFT Simulator",
                              SDL_WINDOWPOS_UNDEFINED, SDL_WINDOWPOS_UNDEFINED,
                              MONITOR_HOR_RES * MONITOR_ZOOM, MONITOR_VER_RES * MONITOR_ZOOM, 0);       /*last param. SDL_WINDOW_BORDERLESS to hide borders*/

    SDL_Renderer * renderer = SDL_CreateRenderer(window, -1, SDL_RENDERER_SOFTWARE);

    SDL_Texture * texture = SDL_CreateTexture(renderer,
                                SDL_PIXELFORMAT_ARGB8888, SDL_TEXTUREACCESS_STATIC, MONITOR_HOR_RES, MONITOR_VER_RES);
    SDL_SetTextureBlendMode(texture, SDL_BLENDMODE_BLEND);

    static uint32_t tft_fb[MONITOR_HOR_RES * MONITOR_VER_RES];
    memset(tft_fb, 0x44, MONITOR_HOR_RES * MONITOR_VER_RES * sizeof(uint32_t));
    SDL_UpdateTexture(texture, NULL, tft_fb, MONITOR_HOR_RES * sizeof(uint32_t));
    SDL_RenderClear(renderer);

    /*Update the renderer with the texture containing the rendered image*/
    SDL_RenderCopy(renderer, texture, NULL, NULL);
    SDL_RenderPresent(renderer);
}

We are getting somewhere, finally!

amirgon commented 3 years ago

The following test case works in a standalone C file, but fails in MicroPython's main function

Very interesting! Did disabling threading make any difference?

embeddedt commented 3 years ago

Didn't think to try that on upstream. Let me see.

embeddedt commented 3 years ago

It does not, from what I can see. I compiled with make -j4 MICROPY_PY_THREAD=0 (clean build).

embeddedt commented 3 years ago

This is very strange because with threading disabled, nothing gets called before the SDL functions. I wonder what happens if I insert a while(1) loop to block the rest of MicroPython from executing.

EDIT: That changes nothing.

amirgon commented 3 years ago

It does not, from what I can see. I compiled with make -j4 MICROPY_PY_THREAD=0 (clean build).

Could you try commenting the call to mp_thread_init? Should be disabled by setting MICROPY_PY_THREAD to 0, but I just want to make sure it drills from the Makefile until there.

embeddedt commented 3 years ago

Yes; I put an #error statement inside there and it did not fire. I will try commenting it too to be sure.

embeddedt commented 3 years ago

It behaves the same way.

amirgon commented 3 years ago

So let me try to understand what's happening here:

SDL test code on a standalone application runs well.
The same SDL test code causes SIGSEGV immediately when running in Micropython as the first thing that runs from "main"

If this is true, then the problem must be related to compilation flags/linking flags of Micropython, or possibly a linker script although I don't think there is any on the unix port.
I don't see any other difference between a test application and the first line on main in Micropython.

If this is the situation then the next step is to build in verbose mode and see all the compilation/linking flags, then try them on the standalone application and see if the problem is reproduced there.

To build in verbose mode: make -C ports/unix V=1

embeddedt commented 3 years ago

The only custom flags I see are these: -lffi -ldl -Wl,-Map=micropython.map,--cref -Wl,--gc-sections -lm -lSDL2 -O0 -fdata-sections -ffunction-sections, but the test application still works when they are applied.

I am trying to think of what else could be different. My theory now is that the larger binary size is the problem, but that doesn't make sense on a PC.

amirgon commented 3 years ago

Another difference - shared libraries.

Maybe Micropython loads a shared library that causes the problem somehow? You can use ldd to print shared library dependencies.
Try to run ldd ports/unix/micropython and compare the result with ldd your-test-app. Probably there are a lot of differences but maybe something would catch your eye.

You can also try to reproduce the problem on your test application by loading all of Micropython's libraries explicitly.
Maybe the simplest way would be to dlopen each of them before running the SDL test code.

embeddedt commented 3 years ago

Interestingly the only difference is libffi. I tried to dlopen it at the start but that does not change anything.

amirgon commented 3 years ago

I tried to dlopen it at the start but that does not change anything.

I think you also need to provide RTLD_NOW flag otherwise it's a lazy load. But I doubt libffi is the problem.

embeddedt commented 3 years ago

Unfortunately that was the flag I used! I will have to keep thinking about it - I have never seen a problem like this on a host system, only on embedded systems where things get corrupted easily.

amirgon commented 3 years ago

That's a true mystery!
What could make two applications to consistently behave so differently, assuming they run the same code, compiled with the same flags and load the same libraries...

Sounds like a good question for Stack Overflow.

embeddedt commented 3 years ago

Indeed. I even tried dumping the ELF files to see whether the sections are any different, but they both have the same sections.

Merry Christmas!

amirgon commented 3 years ago

One more thing you could try is to run it with debugger from the beginning and see/trace everything that's being called. There could be some code that automatically runs before main such as C constructor or some signal handler, some library initialization code etc. Then you could compare the traces between your standalone application and Micropython.

Merry Christmas to you too!

X-Ryl669 commented 2 years ago

There's a bunch of code that's executed in functions marked as attribute((constructor)) in shared libraries. So it's very difficult to figure out the reason of the difference by looking only at the main's file.

Typically, I had an issue like this once and it was due to nvidia's OpenGL library taking the address of all the underlying system OpenGL's library function it didn't implement. A higher level code (GLFW IIRC) was swapping some OpenGL functions with it's own, and so when calling some OpenGL code, when NVidia's code was calling the system's function, the wrong function was called. At that time, I solved the issue by changing the library loading order so that the higher level code was loaded first.

A good test would be to remove libraries one by one until finding the culprit. You can try LD_BIND_NOW='' ./test to force lazy loading libraries that can be loaded this way. Or you can list all libraries with ldd and then objdump them all to find all the symbols in DL_INIT sections. Then place some GDB breakpoint on them (warning, there are many of them), and launch your crashing application. It'll hit each function in some specific order. You can directly exit the constructor function without executing it by setting the $pc register on the ret instruction (or stack unwinding if you stopped after the stack setup). You can use the return command of gdb here to skip executing the function, since most constructor function return void. You'll maybe be able to pinpoint what function is causing the crash (if the crash does not happen after you've skipped function XXX, then you'll have to look what function XXX does, if you have the source code for it).

embeddedt commented 2 years ago

Thanks, @X-Ryl669, for this information. It was a very helpful explanation and makes a lot of sense. I can see why the proprietary Nvidia driver is frowned upon by Linux users, as this function-swapping approach sounds quite fragile.

In the meantime, while playing around with various environment variables to try and get to the root of the issue, I have just found a workaround that is probably good enough for the time being: launching MicroPython with __GLX_VENDOR_LIBRARY_NAME=mesa LIBGL_ALWAYS_SOFTWARE=1 will skip the Nvidia implementation entirely and use software rendering, thus avoiding the crash. On my i3-4150, I don't see any noticeable performance loss in advanced_demo.py compared to when I tested with the Nouveau driver.

X-Ryl669 commented 2 years ago

For the few things that SDL is doing with OpenGL, it's clear there no benefit for an advanced linux driver, basic software driver will work too.

embeddedt commented 2 years ago

I've reopened this for now since #215 didn't actually fix it, but does it need to stay open or should we just close it since there haven't been any further reports/issues?

amirgon commented 2 years ago

I vote for keeping this open, as a reminder that this is not resolved yet.
It's easier to forget about closed issues.

I'm actually not sure why it was automatically closed with #215, I probably did something wrong.

embeddedt commented 2 years ago

The PR description had the phrase "fix #97" in it. That will make GitHub close the issue if it's merged. Unfortunately it's not smart enough to check the wording around that to see if it's a question or a statement. :laughing:

lvgl / lv_binding_micropython

segmentation violation on unix port #97