leezer3 / OpenBVE

OpenBVE- A free train simulator
http://www.openbve-project.net
277 stars 52 forks source link

[Bug] OpenBVE freezes at the loading screen #1050

Open JakubVanek opened 3 months ago

JakubVanek commented 3 months ago

Description

On some Linux OpenGL drivers, OpenBVE will freeze during the "Loading track" screen. The CPU is just idle in that state.

Reproduction

Cause

I have debugged this and it turns out that OpenTK deadlocks the Mesa driver. The game freezes because the SwapBuffers() call never returns: https://github.com/leezer3/OpenBVE/blob/2539167a5b454e059291c7779e5152bd93c98cbe/source/OpenBVE/System/GameWindow.cs#L1125 .

The cause is that OpenTK calls XLockDisplay() before calling Glx.SwapBuffers() (link). The Mesa driver in glthread mode also wants to acquire the lock. The situation can be described like this:

I have found https://github.com/opentk/opentk/pull/691 and that looks somewhat similar, but that applies only to other GLX calls.

When the glthread mode is disabled, I assume that this deadlock will not happen because the GL commands will be processed from within the OpenBVE thread.

Workarounds

DRI configuration file contents ```xml ]> ```

Related issues

It seems to me that https://github.com/leezer3/OpenBVE/issues/944 might be the same issue.

Route

Any route is likely sufficient, but http://md.archive.ubuntu.com/ubuntu/ubuntu/pool/universe/b/bve-route-cross-city-south/bve-route-cross-city-south_1.31.08-0ubuntu2_all.deb was used.

Train

Any train is likely sufficient , but https://packages.ubuntu.com/jammy/bve-train-br-class-323 was used.

Logs

These are from the Intel machine. If needed, I can also provide logs from the AMD machine, but these logs were fairly similar.

OpenBVE log.txt ``` OpenBVE Log: 2024-07-22 12:22:09 Program Version: v1.10.1.1 Attached Joysticks: -------------------- -------------------- 12:22:09 Using openGL 3.0 (new) renderer 12:22:09 Initialising game window of size 960 x 600 12:22:09 Creating game window with standard context. 12:22:09 Game window initialised successfully. 12:22:09 Renderer initialised successfully. 12:22:09 /usr/share/games/bve/Railway : Railway folder found. 12:22:09 INFO: 2 Route loading plugins available. 12:22:09 INFO: 6 Object loading plugins available. 12:22:09 INFO: 2 Sound loading plugins available. 12:22:09 Load in Advance is disabled 12:22:09 Loading route file: /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:09 INFO: Route file hash 2A19D371A0F6AB76613603C1B883A17A39A68773123D67FE7CDFB77DAC9C5FAD 12:22:09 Route file format is: CSV 12:22:09 INFO: Using the Japanese compatibility signal set. 12:22:13 RailIndex 3 does not reference an existing dike in Track.DikeEnd at line 1721, column 13 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 3 does not reference an existing dike in Track.DikeEnd at line 1729, column 13 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 2 does not reference an existing dike in Track.DikeEnd at line 1930, column 17 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 0 does not reference an existing dike in Track.DikeEnd at line 2062, column 4 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 4 does not reference an existing wall in Track.WallEnd at line 2166, column 5 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 1 does not reference an existing dike in Track.DikeEnd at line 2255, column 22 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 0 does not reference an existing dike in Track.DikeEnd at line 2583, column 3 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 2 does not reference an existing dike in Track.DikeEnd at line 2768, column 15 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 0 does not reference an existing wall in Track.WallEnd at line 2926, column 10 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 3 does not reference an existing wall in Track.WallEnd at line 2926, column 12 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 0 does not reference an existing dike in Track.DikeEnd at line 2926, column 14 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 1 does not reference an existing dike in Track.DikeEnd at line 2926, column 15 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 2 does not reference an existing dike in Track.DikeEnd at line 2926, column 16 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 3 does not reference an existing dike in Track.DikeEnd at line 2926, column 17 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 1 does not reference an existing wall in Track.WallEnd at line 2934, column 4 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 9 does not reference an existing dike in Track.DikeEnd at line 2980, column 2 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 9 does not reference an existing dike in Track.DikeEnd at line 2990, column 3 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 4 does not reference an existing dike in Track.DikeEnd at line 2990, column 4 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 1 does not reference an existing dike in Track.DikeEnd at line 3008, column 36 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:14 Route file loaded successfully. 12:22:14 Loading player train: /usr/share/games/bve/Train/BR_Class_323 12:22:14 a0 in section #ACCELERATION is expected to be greater than zero at line 7 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a1 in section #ACCELERATION is expected to be greater than zero at line 7 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v1 in section #ACCELERATION is expected to be greater than zero at line 7 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v2 in section #ACCELERATION is expected to be greater than zero at line 7 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a0 in section #ACCELERATION is expected to be greater than zero at line 8 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a1 in section #ACCELERATION is expected to be greater than zero at line 8 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v1 in section #ACCELERATION is expected to be greater than zero at line 8 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v2 in section #ACCELERATION is expected to be greater than zero at line 8 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a0 in section #ACCELERATION is expected to be greater than zero at line 9 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a1 in section #ACCELERATION is expected to be greater than zero at line 9 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v1 in section #ACCELERATION is expected to be greater than zero at line 9 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v2 in section #ACCELERATION is expected to be greater than zero at line 9 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a0 in section #ACCELERATION is expected to be greater than zero at line 10 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a1 in section #ACCELERATION is expected to be greater than zero at line 10 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v1 in section #ACCELERATION is expected to be greater than zero at line 10 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v2 in section #ACCELERATION is expected to be greater than zero at line 10 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 The #ACCELERATION section defines 8 curves, but the #HANDLE section defines 4 power notches in /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 Loading train panel: /usr/share/games/bve/Train/BR_Class_323/panel.animated 12:22:14 INFO: This train contains both a 2D and a 3D panel. The 3D panel will always take precedence 12:22:14 Train panel loaded sucessfully. 12:22:14 Loading sound.cfg file: /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 The SoundFile airzero.wav was not found at line 55 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 99 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 99 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 99 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 100 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 100 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 100 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 101 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 101 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 101 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 102 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 102 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 102 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 117 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 The #ACCELERATION section defines 0 curves, but the #HANDLE section defines 8 power notches in /usr/lib/openbve/Data/Compatibility/PreTrain/train.dat ```
GDB stacktrace of the Mesa GL thread when the deadlock happens ``` (gdb) bt #0 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x57fb3b781088) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x57fb3b781088) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x57fb3b781088, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139 #3 0x00007091d2093a41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x57fb3ba117b0, cond=0x57fb3b781060) at ./nptl/pthread_cond_wait.c:503 #4 ___pthread_cond_wait (cond=0x57fb3b781060, mutex=0x57fb3ba117b0) at ./nptl/pthread_cond_wait.c:627 --- JV NOTE: This is the Mesa GL thread wanting to lock the X11 lock --- #5 0x00007091cdae71a5 in _XDisplayLockWait (dpy=0x57fb3ba1a8f0) at ../../src/locking.c:451 #6 0x00007091cdb0180c in return_socket (closure=0x57fb3ba1a8f0) at ../../src/xcb_io.c:56 #7 0x00007091cebb3b86 in get_socket_back (c=0x57fb3bb24510) at ../../src/xcb_out.c:96 #8 get_socket_back (c=c@entry=0x57fb3bb24510) at ../../src/xcb_out.c:87 #9 0x00007091cebbb5a6 in prepare_socket_request (c=0x57fb3bb24510) at ../../src/xcb_out.c:126 #10 send_fds (num_fds=0, fds=0x0, c=0x57fb3bb24510) at ../../src/xcb_out.c:196 #11 xcb_send_request_with_fds64 (c=0x57fb3bb24510, flags=flags@entry=1, vector=vector@entry=0x7091497f9600, req=req@entry=0x7091c5204c60 , num_fds=num_fds@entry=0, fds=fds@entry=0x0) at ../../src/xcb_out.c:316 #12 0x00007091cebbbbcf in xcb_send_request64 (c=, flags=flags@entry=1, vector=vector@entry=0x7091497f9600, req=req@entry=0x7091c5204c60 ) at ../../src/xcb_out.c:349 #13 0x00007091cebbbbdd in xcb_send_request (c=, flags=flags@entry=1, vector=vector@entry=0x7091497f9600, req=req@entry=0x7091c5204c60 ) at ../../src/xcb_out.c:355 #14 0x00007091c5201986 in xcb_dri3_get_supported_modifiers (c=, window=, depth=depth@entry=24 '\030', bpp=bpp@entry=32 ' ') at /build/libxcb-9z6Jre/libxcb-1.14/build/src/dri3.c:522 #15 0x00007091a73576a8 in dri3_alloc_render_buffer (draw=draw@entry=0x57fb3b5d11b8, format=format@entry=4099, width=960, height=600, depth=24) at ../src/loader/loader_dri3_helper.c:1479 #16 0x00007091a7359033 in dri3_get_buffer (format=format@entry=4099, buffer_type=buffer_type@entry=loader_dri3_buffer_back, draw=draw@entry=0x57fb3b5d11b8, driDrawable=) at ../src/loader/loader_dri3_helper.c:2080 #17 0x00007091a7359515 in loader_dri3_get_buffers (driDrawable=, format=4099, stamp=0x57fb3be0a880, loaderPrivate=0x57fb3b5d11b8, buffer_mask=, buffers=0x7091497f9880) at ../src/loader/loader_dri3_helper.c:2283 #18 0x0000709185eb9475 in dri_image_drawable_get_buffers (drawable=drawable@entry=0x57fb3be0a880, images=images@entry=0x7091497f9880, statts=statts@entry=0x57fb3b5f87b0, statts_count=statts_count@entry=2) at ../src/gallium/frontends/dri/dri2.c:316 #19 0x0000709185eb95f9 in dri2_allocate_textures (ctx=0x57fb3bcea270, drawable=0x57fb3be0a880, statts=0x57fb3b5f87b0, statts_count=2) at ../src/gallium/frontends/dri/dri2.c:488 #20 0x0000709185ebc4b1 in dri_st_framebuffer_validate (st=, pdrawable=0x57fb3be0a880, statts=0x57fb3b5f87b0, count=, out=0x7091497f9a70, resolve=0x7091497f9a38) at ../src/gallium/frontends/dri/dri_drawable.c:79 #21 0x0000709185f7f654 in st_framebuffer_validate (stfb=stfb@entry=0x57fb3b5f8350, st=st@entry=0x57fb3b42bf60) at ../src/mesa/state_tracker/st_manager.c:239 #22 0x0000709185f804b8 in st_manager_validate_framebuffers (st=st@entry=0x57fb3b42bf60) at ../src/mesa/state_tracker/st_manager.c:1239 #23 0x00007091861d05f4 in st_update_framebuffer_state (st=0x57fb3b42bf60) at ../src/mesa/state_tracker/st_atom_framebuffer.c:120 #24 0x00007091861d517b in st_validate_state (pipeline_state_mask=1107296512, st=0x57fb3b42bf60) at ../src/util/bitscan.h:117 #25 st_Clear (ctx=0x57fb3cc25240, mask=2) at ../src/mesa/state_tracker/st_cb_clear.c:400 #26 0x00007091860334b8 in _mesa_unmarshal_Clear (ctx=, cmd=) at src/mapi/glapi/gen/marshal_generated1.c:216 #27 0x0000709185f2b302 in glthread_unmarshal_batch (job=job@entry=0x57fb3cc2d408, gdata=gdata@entry=0x0, thread_index=thread_index@entry=0) at ../src/mesa/main/glthread.c:139 #28 0x0000709185ec9978 in util_queue_thread_func (input=input@entry=0x57fb3b7c6180) at ../src/util/u_queue.c:309 #29 0x0000709185f1945b in impl_thrd_routine (p=) at ../src/c11/impl/threads_posix.c:67 #30 0x00007091d2094ac3 in start_thread (arg=) at ./nptl/pthread_create.c:442 #31 0x00007091d2126850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 ```
GDB stacktrace of the OpenBVE renderer thread when the deadlock happens ``` (gdb) bt --- JV NOTE: This is the OpenBVE thread waiting for the Mesa GL thread --- #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x0000709185ec3446 in sys_futex (val3=-1, addr2=0x0, timeout=0x0, val1=2, op=9, addr1=0x57fb3cc2d408) at ../src/util/futex.c:43 #2 futex_wait (addr=addr@entry=0x57fb3cc2d408, value=value@entry=2, timeout=timeout@entry=0x0) at ../src/util/futex.c:55 #3 0x0000709185ec957f in do_futex_fence_wait (fence=0x57fb3cc2d408, timeout=timeout@entry=false, abs_timeout=abs_timeout@entry=0) at ../src/util/u_queue.c:130 #4 0x0000709185ec9bcd in _util_queue_fence_wait (fence=) at ../src/util/u_queue.c:145 #5 0x0000709185f2b967 in util_queue_fence_wait (fence=) at ../src/util/u_queue.h:163 #6 _mesa_glthread_finish (ctx=0x57fb3cc25240) at ../src/mesa/main/glthread.c:383 #7 _mesa_glthread_finish (ctx=0x57fb3cc25240) at ../src/mesa/main/glthread.c:364 #8 0x0000709185ebccff in dri_flush (cPriv=0x57fb3bcea270, dPriv=0x57fb3be0a880, flags=3, reason=__DRI2_THROTTLE_SWAPBUFFER) at ../src/gallium/frontends/dri/dri_drawable.c:494 #9 0x00007091a7358292 in loader_dri3_swap_buffers_msc (draw=0x57fb3b5d11b8, target_msc=0, divisor=0, remainder=0, flush_flags=, rects=rects@entry=0x0, n_rects=0, force_copy=false) at ../src/loader/loader_dri3_helper.c:1046 #10 0x00007091a7349df5 in dri3_swap_buffers (pdraw=, target_msc=, divisor=, remainder=, flush=) at ../src/glx/dri3_glx.c:571 #11 0x00007091a73393ff in glXSwapBuffers (dpy=0x57fb3ba1a8f0, drawable=123731970) at ../src/glx/glxcmds.c:677 --- JV NOTE: The frame below is the OpenTK.Platform.X11.Glx:SwapBuffers() call --- #12 0x000000004128fb46 in () #13 0x00007fffdf789a88 in () #14 0x00007091c796e700 in () #15 0x00007091c796e700 in () #16 0x0000000000000006 in () #17 0x00007091c66e9990 in () #18 0x000057fb3a410aa0 in () #19 0x00007fffdf789a20 in () #20 0x00007fffdf789970 in () #21 0x0000000000000000 in () (gdb) p (char*)mono_pmip(0x000000004128fb46) $1 = 0x57fb3d439d30 " (wrapper managed-to-native) OpenTK.Platform.X11.Glx:SwapBuffers (intptr,intptr) [{0x57fb3d2e0db8} + 0xa6] (0x4128faa0 0x4128fbda) [0x57fb3a3f09c0 - OpenBve.exe]" (gdb) ```

If needed, I can also post the full C#/Mono backtrace. However, I remember the following:

Related information

leezer3 commented 3 months ago

OK, that's interesting....

I've only got a pretty elderly Intel integrated graphics (HD3000), which normally is Windows / Hackintosh only. Might be able to get that to reproduce in this case.

Looking into the code, the default timeout for texture loading is 1000ms in the main game ( https://github.com/leezer3/OpenBVE/blob/master/source/OpenBVE/System/GameWindow.cs#L1180 ; you're linking to the Route Viewer method. I'm not sure off the top of my head why there's no timeout here, possibly an oversight) Panel textures have a 20000ms timeout (this is because we've had cases of people using ~20mb JPG files for panels), as they need the texture sizes to calculate element positioning.

I'd suspect there's probably a strong element of machine / disk based timing in this as well.

I'll see what I can do about reproducing with that.

leezer3 commented 3 months ago

Good, that reproduces on the VM with mesa_glthread=true set.

Going to do a little excavating....

leezer3 commented 3 months ago

OK, I think we had a livelock problem there. I've tuned it to use ConcurrentQueues (as opposed to the original lock + 2 queues, which I think was causing this) and simpified it slightly.

I think more people haven't reported this simply due to the fact that timing here was absolutely critical.

@ginga81 The build from today will hopefully resolve your freeze when not running under WINE. If not, it'd be very interesting to see if setting the environment variable above helps you.

(I could probably just set this environment variable process wide, but I'd quite like to actually fix the issue rather than mask it)

JakubVanek commented 3 months ago

If I understand the fix correctly, I think that it does not eliminate the deadlock, it just reduces the likelihood of it.

It might this might be fixed by just removing the using (new XLock(Display)) from OpenTK's X11GLContext.SwapBuffers() (truth to be told, I do not know if that is safe to do). It seemed to me that the bug was not in OpenBVE directly - rather it was a deadlock between OpenTK and Mesa. The infinite loadscreen may still happen if the deadlock occurs -- the LoadingScreenLoop() would not return in that case because it would be stuck in SwapBuffers().

JakubVanek commented 3 months ago

However, I wouldn't rule out that we are seeing two different bugs. The bug that I was seeing could be distinguished by that the SwapBuffers() call would initially succeed, but once the deadlock occurs, it will not return (-> LoadingScreenLoop() stops querying the queue and will also not return). My formulation in the initial bugreport was unfortunate (it implies that any call to SwapBuffers() triggers the deadlock, but that is not the case).

JakubVanek commented 3 months ago

I have tested the commit https://github.com/leezer3/OpenBVE/commit/71689f64a84d8d65b8cb4f0258f407a5bb2cbd04 on the Intel machine and the game is still getting stuck when I use mesa_glthread=true. The GDB backtraces seem to be the same.

The deadlock seems to be reliable (the game never successfully loads with the mesa_glthread=true environment variable and preferNativeBackend set to true in the OpenBVE config file)

leezer3 commented 3 months ago

What I've changed is an issue with queues.

It's a little complex, but basically, if we add an item to a queue from one thread (length > 0), but at the same time are returning the pulse signal from another, one of the two can be silently discarded, despite the fact a lock was in use. e.g. https://stackoverflow.com/questions/3956127/threading-problem-with-monitor-wait-and-monitor-pulse

This then leaves the loop stuck waiting for a return.


It's a pity that this doesn't solve it for you. The lock inside OpenTK is something I'm not too happy about messing with; There are a lot of pointers and magic numbers being thrown around, and if something changes mid render call (screensaver is the obvious answer I can think of), I suspect it'll crash out.

Let me think on it a bit.

JakubVanek commented 3 months ago

I finally got the Mono SDB debugger running on the Intel machine. Here are the C# stacktraces when the deadlock happens:

Output of "thread bt" in Mono sdb debugger when the deadlock happens ``` Thread #6 '' #0 [0x00000000] Interop.Sys.Read #1 [0x00000014] Interop.Sys.Read (no source) call Int32 Sys:Read (IntPtr, Byte*, Int32) #2 [0x00000037] System.IO.CoreFX.FileSystemWatcher.RunningInstance.TryReadEvent (no source) call Int32 Sys:Read (SafeFileHandle, Byte*, Int32) #3 [0x000002D4] System.IO.CoreFX.FileSystemWatcher.RunningInstance.ProcessEvents (no source) call Boolean RunningInstance:TryReadEvent (NotifyEvent) #4 [0x00000006] System.IO.CoreFX.FileSystemWatcher.RunningInstance.<>c.b__14_0 (no source) callvirt Void RunningInstance:ProcessEvents () #5 [0x0000002C] System.Threading.Tasks.Task.InnerInvoke (no source) callvirt Void System.Action`1:Invoke (Object) #6 [0x00000001] System.Threading.Tasks.Task.Execute (no source) callvirt Void System.Threading.Tasks.Task:InnerInvoke () #7 [0x00000006] System.Threading.Tasks.Task.ExecutionContextCallback (no source) callvirt Void System.Threading.Tasks.Task:Execute () #8 [0x00000073] System.Threading.ExecutionContext.RunInternal (no source) callvirt Void System.Threading.ContextCallback:Invoke (Object) #9 [0x00000004] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:RunInternal (ExecutionContext, ContextCallback, Object, Boolean) #10 [0x00000038] System.Threading.Tasks.Task.ExecuteWithThreadLocal (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object, Boolean) #11 [0x00000050] System.Threading.Tasks.Task.ExecuteEntry (no source) call Void System.Threading.Tasks.Task:ExecuteWithThreadLocal (Task) #12 [0x00000007] System.Threading.Tasks.ThreadPoolTaskScheduler.<>c.<.cctor>b__10_0 (no source) callvirt Boolean System.Threading.Tasks.Task:ExecuteEntry (Boolean) #13 [0x00000036] System.Threading.ThreadHelper.ThreadStart_Context (no source) callvirt Void System.Threading.ParameterizedThreadStart:Invoke (Object) #14 [0x00000073] System.Threading.ExecutionContext.RunInternal (no source) callvirt Void System.Threading.ContextCallback:Invoke (Object) #15 [0x00000004] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:RunInternal (ExecutionContext, ContextCallback, Object, Boolean) #16 [0x0000002F] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object, Boolean) #17 [0x0000001B] System.Threading.ThreadHelper.ThreadStart (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object) Thread #13 '' #0 [0x00000030] System.Threading.Thread.SleepInternal #1 [0x0000001A] System.Threading.Thread.Sleep (no source) call Void System.Threading.Thread:SleepInternal (Int32) #2 [0x0000002A] SoundManager.SoundsBase.SoundThread at /home/kuba/tmp/bve-src/source/SoundManager/Sounds.cs:226 Thread.Sleep(100); #3 [0x0000001F] System.Threading.ThreadHelper.ThreadStart_Context (no source) callvirt Void System.Threading.ThreadStart:Invoke () #4 [0x00000073] System.Threading.ExecutionContext.RunInternal (no source) callvirt Void System.Threading.ContextCallback:Invoke (Object) #5 [0x00000004] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:RunInternal (ExecutionContext, ContextCallback, Object, Boolean) #6 [0x0000002F] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object, Boolean) #7 [0x00000014] System.Threading.ThreadHelper.ThreadStart (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object) Thread #2 '' #0 [0x00000030] System.Threading.Thread.SleepInternal #1 [0x0000001A] System.Threading.Thread.Sleep (no source) call Void System.Threading.Thread:SleepInternal (Int32) #2 [0x00000052] LibRender2.BaseRenderer.vt at /home/kuba/tmp/bve-src/source/LibRender2/BaseRenderer.cs:715 Thread.Sleep(100); #3 [0x0000001F] System.Threading.ThreadHelper.ThreadStart_Context (no source) callvirt Void System.Threading.ThreadStart:Invoke () #4 [0x00000073] System.Threading.ExecutionContext.RunInternal (no source) callvirt Void System.Threading.ContextCallback:Invoke (Object) #5 [0x00000004] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:RunInternal (ExecutionContext, ContextCallback, Object, Boolean) #6 [0x0000002F] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object, Boolean) #7 [0x00000014] System.Threading.ThreadHelper.ThreadStart (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object) Thread #1 '' #0 [0x0000000B] OpenTK.Platform.X11.Glx.SwapBuffers #1 [0x00000072] OpenTK.Platform.X11.X11GLContext.SwapBuffers (no source) call Void OpenTK.Platform.X11.Glx:SwapBuffers (IntPtr, IntPtr) #2 [0x00000006] OpenTK.Graphics.GraphicsContext.SwapBuffers (no source) callvirt Void OpenTK.Graphics.IGraphicsContext:SwapBuffers () #3 [0x0000000C] OpenTK.GameWindow.SwapBuffers (no source) callvirt Void OpenTK.Graphics.IGraphicsContext:SwapBuffers () #4 [0x0000020C] OpenBve.OpenBVEGame.LoadingScreenLoop at /home/kuba/tmp/bve-src/source/OpenBVE/System/GameWindow.cs:1125 Program.currentGameWindow.SwapBuffers(); #5 [0x000001EB] OpenBve.OpenBVEGame.OnLoad at /home/kuba/tmp/bve-src/source/OpenBVE/System/GameWindow.cs:416 LoadingScreenLoop(); #6 [0x00000002] OpenTK.GameWindow.OnLoadInternal (no source) callvirt Void OpenTK.GameWindow:OnLoad (EventArgs) #7 [0x00000095] OpenTK.GameWindow.Run (no source) call Void OpenTK.GameWindow:OnLoadInternal (EventArgs) #8 [0x00000013] OpenTK.GameWindow.Run (no source) call Void OpenTK.GameWindow:Run (Double, Double) #9 [0x000002EC] OpenBve.MainLoop.StartLoopEx at /home/kuba/tmp/bve-src/source/OpenBVE/System/MainLoop.cs:115 Program.currentGameWindow.Run(); #10 [0x000007C1] OpenBve.Program.Main at /home/kuba/tmp/bve-src/source/OpenBVE/System/Program.cs:294 MainLoop.StartLoopEx(result); Thread #4 '' #0 [0x00000030] System.Threading.Thread.SleepInternal #1 [0x0000001A] System.Threading.Thread.Sleep (no source) call Void System.Threading.Thread:SleepInternal (Int32) #2 [0x00000015] OpenTK.Platform.Linux.LinuxJoystick.ProcessEvents (no source) call Void System.Threading.Thread:Sleep (Int32) #3 [0x0000001F] System.Threading.ThreadHelper.ThreadStart_Context (no source) callvirt Void System.Threading.ThreadStart:Invoke () #4 [0x00000073] System.Threading.ExecutionContext.RunInternal (no source) callvirt Void System.Threading.ContextCallback:Invoke (Object) #5 [0x00000004] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:RunInternal (ExecutionContext, ContextCallback, Object, Boolean) #6 [0x0000002F] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object, Boolean) #7 [0x00000014] System.Threading.ThreadHelper.ThreadStart (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object) Thread #5 '' #0 [0x00000000] Interop.Sys.Read #1 [0x00000014] Interop.Sys.Read (no source) call Int32 Sys:Read (IntPtr, Byte*, Int32) #2 [0x00000037] System.IO.CoreFX.FileSystemWatcher.RunningInstance.TryReadEvent (no source) call Int32 Sys:Read (SafeFileHandle, Byte*, Int32) #3 [0x000002D4] System.IO.CoreFX.FileSystemWatcher.RunningInstance.ProcessEvents (no source) call Boolean RunningInstance:TryReadEvent (NotifyEvent) #4 [0x00000006] System.IO.CoreFX.FileSystemWatcher.RunningInstance.<>c.b__14_0 (no source) callvirt Void RunningInstance:ProcessEvents () #5 [0x0000002C] System.Threading.Tasks.Task.InnerInvoke (no source) callvirt Void System.Action`1:Invoke (Object) #6 [0x00000001] System.Threading.Tasks.Task.Execute (no source) callvirt Void System.Threading.Tasks.Task:InnerInvoke () #7 [0x00000006] System.Threading.Tasks.Task.ExecutionContextCallback (no source) callvirt Void System.Threading.Tasks.Task:Execute () #8 [0x00000073] System.Threading.ExecutionContext.RunInternal (no source) callvirt Void System.Threading.ContextCallback:Invoke (Object) #9 [0x00000004] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:RunInternal (ExecutionContext, ContextCallback, Object, Boolean) #10 [0x00000038] System.Threading.Tasks.Task.ExecuteWithThreadLocal (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object, Boolean) #11 [0x00000050] System.Threading.Tasks.Task.ExecuteEntry (no source) call Void System.Threading.Tasks.Task:ExecuteWithThreadLocal (Task) #12 [0x00000007] System.Threading.Tasks.ThreadPoolTaskScheduler.<>c.<.cctor>b__10_0 (no source) callvirt Boolean System.Threading.Tasks.Task:ExecuteEntry (Boolean) #13 [0x00000036] System.Threading.ThreadHelper.ThreadStart_Context (no source) callvirt Void System.Threading.ParameterizedThreadStart:Invoke (Object) #14 [0x00000073] System.Threading.ExecutionContext.RunInternal (no source) callvirt Void System.Threading.ContextCallback:Invoke (Object) #15 [0x00000004] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:RunInternal (ExecutionContext, ContextCallback, Object, Boolean) #16 [0x0000002F] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object, Boolean) #17 [0x0000001B] System.Threading.ThreadHelper.ThreadStart (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object) Thread #7 '' Backtrace for this thread is unavailable Thread #9 '' Backtrace for this thread is unavailable Thread #3 '' #0 [0x00000000] OpenTK.Platform.X11.XI2MouseKeyboard.IsEventValid (no source) ldc.i4.0 #1 [0x00000016] OpenTK.Platform.X11.Functions.XIfEvent #2 [0x0000003C] OpenTK.Platform.X11.XI2MouseKeyboard.ProcessEvents (no source) call Boolean OpenTK.Platform.X11.Functions:XIfEvent (IntPtr, XEvent, IntPtr, IntPtr) #3 [0x0000001F] System.Threading.ThreadHelper.ThreadStart_Context (no source) callvirt Void System.Threading.ThreadStart:Invoke () #4 [0x00000073] System.Threading.ExecutionContext.RunInternal (no source) callvirt Void System.Threading.ContextCallback:Invoke (Object) #5 [0x00000004] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:RunInternal (ExecutionContext, ContextCallback, Object, Boolean) #6 [0x0000002F] System.Threading.ExecutionContext.Run (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object, Boolean) #7 [0x00000014] System.Threading.ThreadHelper.ThreadStart (no source) call Void System.Threading.ExecutionContext:Run (ExecutionContext, ContextCallback, Object) ```
JakubVanek commented 3 months ago

It's a little complex, but basically, if we add an item to a queue from one thread (length > 0), but at the same time are returning the pulse signal from another, one of the two can be silently discarded, despite the fact a lock was in use.

I've missed this, I agree that this could also be causing problems on the loading screen.

ginga81 commented 3 months ago

I immediately tried the daily build, but unfortunately it was at 0%. What confuses me is that on my current Ubuntu 22.04, 1.7.2.0, 1.8.0.2, and 1.9.0.3 are all stuck at 0%. From this, I suspect that it's not an OpenBVE problem, but rather that it's due to an update to mono or OpenGL or something, but what do you think? In any case, I'm sure that it has stopped working, so the fact remains that we need to find the cause and fix it...

leezer3 commented 3 months ago

Have you tried setting the environment variable mentioned in the first message?

If not, please try the build which will generate in a few minutes, which will automatically set this variable. That's a sub-par solution, but possibly the best we can get if it works.


For what it's worth, after looking into the OpenTK code, I can't see this being anything other than a MESA bug. Literally all that's inside the freezing call is an external call to glXSwapBuffers That really shouldn't be blocking on MESA, assuming XInitThreads has been called correctly. I'll do some more digging, but I suspect it'll need someone more competent than me to actually figure out what MESA is doing wrong.

ginga81 commented 3 months ago

Sorry for not setting the environment variables. I just tried it with the build, and when I started it from the command line, 'ATTENTION: default value of option mesa_glthread overridden by environment.' was displayed on the terminal, and it loaded normally! The long-running problem has finally been solved. I can't believe it was caused by a MESA bug... Anyway, thank you so much. I think that's resolved for now. I'll keep an eye on it for a while.

JakubVanek commented 3 months ago

I'm now swayed towards this being a Mesa bug as well. I can reproduce this on the generic Zink OpenGL-on-Vulkan driver (MESA_LOADER_DRIVER_OVERRIDE=zink environment variable) and the backtraces in https://gitlab.freedesktop.org/mesa/mesa/-/issues/8994 look similar to the OpenBVE backtraces.

leezer3 commented 3 months ago

Only thing I do wonder is if OpenTK is mis-handling a failing call somewhere, as nested locking within a thread is absolutely possible with X. If it hasn't called to release the lock exactly the same number of times as it took it, then this might actually be expected. (if IMHO atrocious design on the part of X) https://linux.die.net/man/3/xinitthreads

I'll try and see about looking into that, but OpenTK build isn't liking the main Windows machine at the minute....

JakubVanek commented 3 months ago

I have now tested the latest nightly OpenBVE build on the AMD machine (where the mesa_glthread setting was implicitly enabled before) and can confirm that the issue does not appear there anymore.

JakubVanek commented 3 months ago

I have reproduced the Mesa deadlock using a tiny C app and I have created a Mesa bugreport for that - https://gitlab.freedesktop.org/mesa/mesa/-/issues/11558.