Open JakubVanek opened 4 months ago
OK, that's interesting....
I've only got a pretty elderly Intel integrated graphics (HD3000), which normally is Windows / Hackintosh only. Might be able to get that to reproduce in this case.
Looking into the code, the default timeout for texture loading is 1000ms in the main game ( https://github.com/leezer3/OpenBVE/blob/master/source/OpenBVE/System/GameWindow.cs#L1180 ; you're linking to the Route Viewer method. I'm not sure off the top of my head why there's no timeout here, possibly an oversight) Panel textures have a 20000ms timeout (this is because we've had cases of people using ~20mb JPG files for panels), as they need the texture sizes to calculate element positioning.
I'd suspect there's probably a strong element of machine / disk based timing in this as well.
I'll see what I can do about reproducing with that.
Good, that reproduces on the VM with mesa_glthread=true set.
Going to do a little excavating....
OK, I think we had a livelock problem there. I've tuned it to use ConcurrentQueues (as opposed to the original lock + 2 queues, which I think was causing this) and simpified it slightly.
I think more people haven't reported this simply due to the fact that timing here was absolutely critical.
@ginga81 The build from today will hopefully resolve your freeze when not running under WINE. If not, it'd be very interesting to see if setting the environment variable above helps you.
(I could probably just set this environment variable process wide, but I'd quite like to actually fix the issue rather than mask it)
If I understand the fix correctly, I think that it does not eliminate the deadlock, it just reduces the likelihood of it.
It might this might be fixed by just removing the using (new XLock(Display))
from OpenTK's X11GLContext.SwapBuffers()
(truth to be told, I do not know if that is safe to do). It seemed to me that the bug was not in OpenBVE directly - rather it was a deadlock between OpenTK and Mesa. The infinite loadscreen may still happen if the deadlock occurs -- the LoadingScreenLoop()
would not return in that case because it would be stuck in SwapBuffers()
.
However, I wouldn't rule out that we are seeing two different bugs. The bug that I was seeing could be distinguished by that the SwapBuffers()
call would initially succeed, but once the deadlock occurs, it will not return (-> LoadingScreenLoop()
stops querying the queue and will also not return). My formulation in the initial bugreport was unfortunate (it implies that any call to SwapBuffers()
triggers the deadlock, but that is not the case).
I have tested the commit https://github.com/leezer3/OpenBVE/commit/71689f64a84d8d65b8cb4f0258f407a5bb2cbd04 on the Intel machine and the game is still getting stuck when I use mesa_glthread=true
. The GDB backtraces seem to be the same.
The deadlock seems to be reliable (the game never successfully loads with the mesa_glthread=true
environment variable and preferNativeBackend
set to true
in the OpenBVE config file)
What I've changed is an issue with queues.
It's a little complex, but basically, if we add an item to a queue from one thread (length > 0), but at the same time are returning the pulse signal from another, one of the two can be silently discarded, despite the fact a lock was in use. e.g. https://stackoverflow.com/questions/3956127/threading-problem-with-monitor-wait-and-monitor-pulse
This then leaves the loop stuck waiting for a return.
It's a pity that this doesn't solve it for you. The lock inside OpenTK is something I'm not too happy about messing with; There are a lot of pointers and magic numbers being thrown around, and if something changes mid render call (screensaver is the obvious answer I can think of), I suspect it'll crash out.
Let me think on it a bit.
I finally got the Mono SDB debugger running on the Intel machine. Here are the C# stacktraces when the deadlock happens:
It's a little complex, but basically, if we add an item to a queue from one thread (length > 0), but at the same time are returning the pulse signal from another, one of the two can be silently discarded, despite the fact a lock was in use.
I've missed this, I agree that this could also be causing problems on the loading screen.
I immediately tried the daily build, but unfortunately it was at 0%. What confuses me is that on my current Ubuntu 22.04, 1.7.2.0, 1.8.0.2, and 1.9.0.3 are all stuck at 0%. From this, I suspect that it's not an OpenBVE problem, but rather that it's due to an update to mono or OpenGL or something, but what do you think? In any case, I'm sure that it has stopped working, so the fact remains that we need to find the cause and fix it...
Have you tried setting the environment variable mentioned in the first message?
If not, please try the build which will generate in a few minutes, which will automatically set this variable. That's a sub-par solution, but possibly the best we can get if it works.
For what it's worth, after looking into the OpenTK code, I can't see this being anything other than a MESA bug.
Literally all that's inside the freezing call is an external call to glXSwapBuffers
That really shouldn't be blocking on MESA, assuming XInitThreads
has been called correctly.
I'll do some more digging, but I suspect it'll need someone more competent than me to actually figure out what MESA is doing wrong.
Sorry for not setting the environment variables. I just tried it with the build, and when I started it from the command line, 'ATTENTION: default value of option mesa_glthread overridden by environment.' was displayed on the terminal, and it loaded normally! The long-running problem has finally been solved. I can't believe it was caused by a MESA bug... Anyway, thank you so much. I think that's resolved for now. I'll keep an eye on it for a while.
I'm now swayed towards this being a Mesa bug as well. I can reproduce this on the generic Zink OpenGL-on-Vulkan driver (MESA_LOADER_DRIVER_OVERRIDE=zink
environment variable) and the backtraces in https://gitlab.freedesktop.org/mesa/mesa/-/issues/8994 look similar to the OpenBVE backtraces.
Only thing I do wonder is if OpenTK is mis-handling a failing call somewhere, as nested locking within a thread is absolutely possible with X. If it hasn't called to release the lock exactly the same number of times as it took it, then this might actually be expected. (if IMHO atrocious design on the part of X) https://linux.die.net/man/3/xinitthreads
I'll try and see about looking into that, but OpenTK build isn't liking the main Windows machine at the minute....
I have now tested the latest nightly OpenBVE build on the AMD machine (where the mesa_glthread
setting was implicitly enabled before) and can confirm that the issue does not appear there anymore.
I have reproduced the Mesa deadlock using a tiny C app and I have created a Mesa bugreport for that - https://gitlab.freedesktop.org/mesa/mesa/-/issues/11558.
Description
On some Linux OpenGL drivers, OpenBVE will freeze during the "Loading track" screen. The CPU is just idle in that state.
Reproduction
On machines with AMD GPUs, the
mesa_glthread=true
part is not even necessary. This is because Mesa enabled its OpenGL threaded dispatch in new releases by default: https://www.phoronix.com/news/Mesa-22.3-RadeonSI-glthread-OnCause
I have debugged this and it turns out that OpenTK deadlocks the Mesa driver. The game freezes because the
SwapBuffers()
call never returns: https://github.com/leezer3/OpenBVE/blob/2539167a5b454e059291c7779e5152bd93c98cbe/source/OpenBVE/System/GameWindow.cs#L1125 .The cause is that OpenTK calls
XLockDisplay()
before callingGlx.SwapBuffers()
(link). The Mesa driver in glthread mode also wants to acquire the lock. The situation can be described like this:XLockDisplay()
Glx.SwapBuffers()
XLockDisplay()
equivalent likely to be able to access the X11 socketI have found https://github.com/opentk/opentk/pull/691 and that looks somewhat similar, but that applies only to other GLX calls.
When the glthread mode is disabled, I assume that this deadlock will not happen because the GL commands will be processed from within the OpenBVE thread.
Workarounds
mesa_glthread=false
environment variable. This switches the Mesa driver into a mode where this bug does not happen./usr/share/drirc.d/01-openbve.conf
with the following contents:DRI configuration file contents
```xml ]>preferNativeBackend
OpenBVE option in~/.config/OpenBve/Settings/1.5.0/options.cfg
tofalse
. This makes OpenTK use SDL2 for SwapBuffers and SDL2 does not have this issue.Related issues
It seems to me that https://github.com/leezer3/OpenBVE/issues/944 might be the same issue.
Route
Any route is likely sufficient, but http://md.archive.ubuntu.com/ubuntu/ubuntu/pool/universe/b/bve-route-cross-city-south/bve-route-cross-city-south_1.31.08-0ubuntu2_all.deb was used.
Train
Any train is likely sufficient , but https://packages.ubuntu.com/jammy/bve-train-br-class-323 was used.
Logs
These are from the Intel machine. If needed, I can also provide logs from the AMD machine, but these logs were fairly similar.
OpenBVE log.txt
``` OpenBVE Log: 2024-07-22 12:22:09 Program Version: v1.10.1.1 Attached Joysticks: -------------------- -------------------- 12:22:09 Using openGL 3.0 (new) renderer 12:22:09 Initialising game window of size 960 x 600 12:22:09 Creating game window with standard context. 12:22:09 Game window initialised successfully. 12:22:09 Renderer initialised successfully. 12:22:09 /usr/share/games/bve/Railway : Railway folder found. 12:22:09 INFO: 2 Route loading plugins available. 12:22:09 INFO: 6 Object loading plugins available. 12:22:09 INFO: 2 Sound loading plugins available. 12:22:09 Load in Advance is disabled 12:22:09 Loading route file: /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:09 INFO: Route file hash 2A19D371A0F6AB76613603C1B883A17A39A68773123D67FE7CDFB77DAC9C5FAD 12:22:09 Route file format is: CSV 12:22:09 INFO: Using the Japanese compatibility signal set. 12:22:13 RailIndex 3 does not reference an existing dike in Track.DikeEnd at line 1721, column 13 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 3 does not reference an existing dike in Track.DikeEnd at line 1729, column 13 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 2 does not reference an existing dike in Track.DikeEnd at line 1930, column 17 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 0 does not reference an existing dike in Track.DikeEnd at line 2062, column 4 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 4 does not reference an existing wall in Track.WallEnd at line 2166, column 5 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 1 does not reference an existing dike in Track.DikeEnd at line 2255, column 22 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 0 does not reference an existing dike in Track.DikeEnd at line 2583, column 3 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 2 does not reference an existing dike in Track.DikeEnd at line 2768, column 15 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 0 does not reference an existing wall in Track.WallEnd at line 2926, column 10 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 3 does not reference an existing wall in Track.WallEnd at line 2926, column 12 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 0 does not reference an existing dike in Track.DikeEnd at line 2926, column 14 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 1 does not reference an existing dike in Track.DikeEnd at line 2926, column 15 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 2 does not reference an existing dike in Track.DikeEnd at line 2926, column 16 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 3 does not reference an existing dike in Track.DikeEnd at line 2926, column 17 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 1 does not reference an existing wall in Track.WallEnd at line 2934, column 4 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 9 does not reference an existing dike in Track.DikeEnd at line 2980, column 2 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 9 does not reference an existing dike in Track.DikeEnd at line 2990, column 3 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 4 does not reference an existing dike in Track.DikeEnd at line 2990, column 4 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:13 RailIndex 1 does not reference an existing dike in Track.DikeEnd at line 3008, column 36 in file /usr/share/games/bve/Railway/Route/Birmingham_Cross-City_South/Day/323_Summer_2002_0931_Dry_Clear.csv 12:22:14 Route file loaded successfully. 12:22:14 Loading player train: /usr/share/games/bve/Train/BR_Class_323 12:22:14 a0 in section #ACCELERATION is expected to be greater than zero at line 7 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a1 in section #ACCELERATION is expected to be greater than zero at line 7 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v1 in section #ACCELERATION is expected to be greater than zero at line 7 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v2 in section #ACCELERATION is expected to be greater than zero at line 7 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a0 in section #ACCELERATION is expected to be greater than zero at line 8 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a1 in section #ACCELERATION is expected to be greater than zero at line 8 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v1 in section #ACCELERATION is expected to be greater than zero at line 8 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v2 in section #ACCELERATION is expected to be greater than zero at line 8 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a0 in section #ACCELERATION is expected to be greater than zero at line 9 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a1 in section #ACCELERATION is expected to be greater than zero at line 9 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v1 in section #ACCELERATION is expected to be greater than zero at line 9 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v2 in section #ACCELERATION is expected to be greater than zero at line 9 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a0 in section #ACCELERATION is expected to be greater than zero at line 10 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 a1 in section #ACCELERATION is expected to be greater than zero at line 10 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v1 in section #ACCELERATION is expected to be greater than zero at line 10 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 v2 in section #ACCELERATION is expected to be greater than zero at line 10 in file /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 The #ACCELERATION section defines 8 curves, but the #HANDLE section defines 4 power notches in /usr/share/games/bve/Train/BR_Class_323/train.dat 12:22:14 Loading train panel: /usr/share/games/bve/Train/BR_Class_323/panel.animated 12:22:14 INFO: This train contains both a 2D and a 3D panel. The 3D panel will always take precedence 12:22:14 Train panel loaded sucessfully. 12:22:14 Loading sound.cfg file: /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 The SoundFile airzero.wav was not found at line 55 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 99 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 99 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 99 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 100 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 100 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 100 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 101 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 101 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 101 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 102 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 102 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 102 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 FileName contains illegal characters or is empty at line 117 in file /usr/share/games/bve/Train/BR_Class_323/sound.cfg 12:22:14 The #ACCELERATION section defines 0 curves, but the #HANDLE section defines 8 power notches in /usr/lib/openbve/Data/Compatibility/PreTrain/train.dat ```GDB stacktrace of the Mesa GL thread when the deadlock happens
``` (gdb) bt #0 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x57fb3b781088) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x57fb3b781088) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x57fb3b781088, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139 #3 0x00007091d2093a41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x57fb3ba117b0, cond=0x57fb3b781060) at ./nptl/pthread_cond_wait.c:503 #4 ___pthread_cond_wait (cond=0x57fb3b781060, mutex=0x57fb3ba117b0) at ./nptl/pthread_cond_wait.c:627 --- JV NOTE: This is the Mesa GL thread wanting to lock the X11 lock --- #5 0x00007091cdae71a5 in _XDisplayLockWait (dpy=0x57fb3ba1a8f0) at ../../src/locking.c:451 #6 0x00007091cdb0180c in return_socket (closure=0x57fb3ba1a8f0) at ../../src/xcb_io.c:56 #7 0x00007091cebb3b86 in get_socket_back (c=0x57fb3bb24510) at ../../src/xcb_out.c:96 #8 get_socket_back (c=c@entry=0x57fb3bb24510) at ../../src/xcb_out.c:87 #9 0x00007091cebbb5a6 in prepare_socket_request (c=0x57fb3bb24510) at ../../src/xcb_out.c:126 #10 send_fds (num_fds=0, fds=0x0, c=0x57fb3bb24510) at ../../src/xcb_out.c:196 #11 xcb_send_request_with_fds64 (c=0x57fb3bb24510, flags=flags@entry=1, vector=vector@entry=0x7091497f9600, req=req@entry=0x7091c5204c60GDB stacktrace of the OpenBVE renderer thread when the deadlock happens
``` (gdb) bt --- JV NOTE: This is the OpenBVE thread waiting for the Mesa GL thread --- #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x0000709185ec3446 in sys_futex (val3=-1, addr2=0x0, timeout=0x0, val1=2, op=9, addr1=0x57fb3cc2d408) at ../src/util/futex.c:43 #2 futex_wait (addr=addr@entry=0x57fb3cc2d408, value=value@entry=2, timeout=timeout@entry=0x0) at ../src/util/futex.c:55 #3 0x0000709185ec957f in do_futex_fence_wait (fence=0x57fb3cc2d408, timeout=timeout@entry=false, abs_timeout=abs_timeout@entry=0) at ../src/util/u_queue.c:130 #4 0x0000709185ec9bcd in _util_queue_fence_wait (fence=If needed, I can also post the full C#/Mono backtrace. However, I remember the following:
Related information