google-deepmind / mujoco

Multi-Joint dynamics with Contact. A general purpose physics simulator.
https://mujoco.org
Apache License 2.0
7.47k stars 734 forks source link

GLFW crashes on aarch64 wayland #1693

Closed m8dotpie closed 1 month ago

m8dotpie commented 1 month ago

To see the behaviour it is enough to simply call the python -m mujoco.viewer. What is strange though, it is not directly reproducible. For instance, if I restart user session, viewer works fine for some time. But after some time it freezes and crashes without any way to recover. I tried specifying all 3 possible backends, all lead to the same behavior.

Crash log:

/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/glfw/__init__.py:914: GLFWError: (65548) b'Wayland: The platform does not provide the window position'
Fatal Python error: AbortedFWError)

Thread 0x0000ffff13fff1a0 (most recent call first):
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/mujoco/viewer.py", line 248 in _physics_loop
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/threading.py", line 953 in run
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x0000ffff741f4920 (most recent call first):
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/mujoco/viewer.py", line 401 in _launch_internal
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/mujoco/viewer.py", line 416 in launch
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/mujoco/viewer.py", line 494 in main
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/absl/app.py", line 254 in _run_main
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/absl/app.py", line 308 in run
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/mujoco/viewer.py", line 496 in <module>
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/runpy.py", line 86 in _run_code
  File "/home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/runpy.py", line 196 in _run_module_as_main

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 13)
[1]    8625 IOT instruction (core dumped)  python -m mujoco.viewer

Full backtrace from GDB:

#0  0x0000aaaaaac82570 in faulthandler_fatal_error ()
#1  0x0000fffff7ff4800 in <signal handler called> ()
#2  __pthread_kill_implementation (threadid=281474842183968, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
        tid = 9469
        ret = 0
        pd = 0xfffff7fb4920
        old_mask = {__val = {187650006555312}}
        ret = <optimized out>
#3  0x0000fffff7ca8650 [PAC] in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
#4  0x0000fffff7c55a00 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#5  0x0000fffff7c40288 [PAC] in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {281474842185824, 281474840291056, 32, 11, 281474176007090, 281474176261680, 616416566748138496, 281474976670640, 13229322959394464, 187650006555312, 281474976671176, 281474176118496, 1, 1, 281474176007090, 281474976670672}}, sa_flags = -971298024, sa_restorer = 0xaaaaac135180}
#6  0x0000ffffd004c234 [PAC] in wl_abort (fmt=<optimized out>) at ../src/wayland-util.c:462
        argp = {__stack = 0xffffffff6520, __gr_top = 0xffffffff6520, __vr_top = 0xffffffff64e0, __gr_offs = -56, __vr_offs = -128}
#7  0x0000ffffd004acd4 in wl_closure_invoke (closure=0xaaaaac135180, flags=1, target=<optimized out>, opcode=1, data=<optimized out>) at ../src/connection.c:1022
        count = 1
        cif = {abi = FFI_SYSV, nargs = 3, arg_types = 0xffffffff65e8, rtype = 0xfffff7b97370 <ffi_type_void>, bytes = 32, flags = 0}
        ffi_types = {0xfffff7b97298 <ffi_type_pointer>, 0xfffff7b97298 <ffi_type_pointer>, 0xfffff7b972f8 <ffi_type_uint32>, 0xfffff7b972f8 <ffi_type_uint32>, 0xfffff7b972f8 <ffi_type_uint32>, 0xfffff7b972f8 <ffi_type_uint32>, 0xfffff7b972f8 <ffi_type_uint32>, 0xffffffff6780, 0xffffd00485f0 <wl_display_flush+64>, 0xaaaaab516c20, 0xfffff7d05f20 <__GI___poll>, 0x0, 0xaaaaab516d38, 0xffffffff6660, 0x45fffff7cb4840, 0xffffffff6690, 0xdfffff7cb5e4c, 0xaaaaac0ea6a0, 0xfffff7de0a50 <main_arena>, 0xf0, 0xaaaaac0ea790, 0xffffffff66d0}
        ffi_args = {0xffffffff65b0, 0xffffffff65b8, 0xaaaaac135198, 0xfffff7de6000 <__pthread_keys+14928>, 0xfffff7de0a50 <main_arena>, 0x0, 0xaaaaac14a998, 0xffffffff6720, 0xffffff7cb8bd0, 0xaaaaac0ea6a0, 0xfffff7fb5060, 0xfffff7de66f0 <global_max_fast>, 0x20, 0xb, 0xaaaaac0ea6b0, 0x0, 0xffffd004ba80 <wl_list_empty>, 0xffffffff6760, 0x1dffffd00465b4, 0xffffffff6750, 0xfffffd0046884, 0xaaaaac0ea7a0}
        implementation = 0xffffd047eee0 <dataOfferListener>
#8  0x0000ffffd0046920 in dispatch_event (display=0xaaaaab516c20, queue=0xaaaaab516d10) at ../src/wayland-client.c:1631
        closure = 0xaaaaac135180
        proxy = 0xaaaaac0ea7a0
        opcode = 1
        proxy_destroyed = <optimized out>
#9  0x0000ffffd0048468 in dispatch_queue (queue=0xaaaaab516d10, display=0xaaaaab516c20) at ../src/wayland-client.c:1777
        count = 6
        count = <optimized out>
        err = <optimized out>
#10 wl_display_dispatch_queue_pending (display=0xaaaaab516c20, queue=0xaaaaab516d10) at ../src/wayland-client.c:2019
        ret = <optimized out>
#11 0x0000ffffd043d070 in handleEvents () at /home/m8dotpie/miniforge3/envs/sber-croc/lib/libglfw.so.3.4
#12 0x0000ffffd0440728 in _glfwPollEventsWayland () at /home/m8dotpie/miniforge3/envs/sber-croc/lib/libglfw.so.3.4
#13 0x0000ffffd03c9098 in ??? () at /home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/mujoco/_simulate.cpython-310-aarch64-linux-gnu.so
#14 0x0000ffffd03b8560 in ??? () at /home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/mujoco/_simulate.cpython-310-aarch64-linux-gnu.so
#15 0x0000ffffd03abd3c in ??? () at /home/m8dotpie/miniforge3/envs/sber-croc/lib/python3.10/site-packages/mujoco/_simulate.cpython-310-aarch64-linux-gnu.so
#16 0x0000aaaaaacb8edc in cfunction_call ()
#17 0x0000aaaaaab1c3c0 in _PyObject_MakeTpCall ()
#18 0x0000aaaaaaca082c in method_vectorcall ()
#19 0x0000aaaaaab0c1c0 in _PyEval_EvalFrameDefault ()
#20 0x0000aaaaaabb7460 in _PyEval_Vector ()
#21 0x0000aaaaaab0ae90 in _PyEval_EvalFrameDefault ()
#22 0x0000aaaaaabb7460 in _PyEval_Vector ()
#23 0x0000aaaaaab0b0e0 in _PyEval_EvalFrameDefault ()
#24 0x0000aaaaaabb7460 in _PyEval_Vector ()
#25 0x0000aaaaaab0b0e0 in _PyEval_EvalFrameDefault ()
#26 0x0000aaaaaabb7460 in _PyEval_Vector ()
#27 0x0000aaaaaab0b0e0 in _PyEval_EvalFrameDefault ()
#28 0x0000aaaaaabb7460 in _PyEval_Vector ()
#29 0x0000aaaaaab0c1c0 in _PyEval_EvalFrameDefault ()
#30 0x0000aaaaaabb7460 in _PyEval_Vector ()
#31 0x0000aaaaaabb7654 in PyEval_EvalCode ()
#32 0x0000aaaaaace6704 in builtin_exec ()
#33 0x0000aaaaaacb9878 in cfunction_vectorcall_FASTCALL ()
#34 0x0000aaaaaab0b0e0 in _PyEval_EvalFrameDefault ()
#35 0x0000aaaaaabb7460 in _PyEval_Vector ()
#36 0x0000aaaaaab0b0e0 in _PyEval_EvalFrameDefault ()
#37 0x0000aaaaaabb7460 in _PyEval_Vector ()
#38 0x0000aaaaaab0ecc0 in pymain_run_module ()
--Type <RET> for more, q to quit, c to continue without paging--
#39 0x0000aaaaaab0f390 in Py_RunMain ()
#40 0x0000aaaaaab0fec4 in Py_BytesMain ()
#41 0x0000fffff7c40a1c in __libc_start_call_main (main=main@entry=0xaaaaaab04f90 <main>, argc=argc@entry=4, argv=argv@entry=0xffffffffe678) at ../sysdeps/nptl/libc_start_call_main.h:58
        self = <optimized out>
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281474976704120, 4, 187649987031996, 187649984843664, 281474976704160, 281474842475312, 0, 281474842476544, 0, 0, 281474976704000, 4040905266174833222, 281470681743369, 4040905266309422750, 0, 0, 0, 0, 0, 0, 0, 0}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x4, 0xaaaaaad1b3bc <__libc_csu_init>}, data = {prev = 0x0, cleanup = 0x0, canceltype = 4}}}
        not_first_call = <optimized out>
#42 0x0000fffff7c40afc [PAC] in __libc_start_main_impl (main=0xaaaaaab04f90 <main>, argc=4, argv=0xffffffffe678, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>)
    at ../csu/libc-start.c:360
#43 0x0000aaaaaab0e6b8 [PAC] in _start ()

Let me know if you need something else or I should try something to resolve this.

Context:

m8dotpie commented 1 month ago

Tried running precompiled binaries outside the conda environment and they seem to work fine. Error is on the glfw in env side, I assume?

UPD 1. However, I have ran ./simulate. Not sure if this should reproduce the bahaviour. UPD 2. python bindings outside the conda env do not work either

m8dotpie commented 1 month ago

Yes, the problem is in PyGLFW bindings served by env. To resolve the issue I have set the env variable PYGLFW_LIBRARY=/usr/lib/libglfw.so and reinstalled the bindings. This resolved the issue, even though the solution is cumbersome to repeat for each env.

Moreover, I am probably confusing the glfw from conda-forge and from pip. They are two separate things obviously. So specifying env variable with the direction of glfw should be enough.

yuvaltassa commented 1 month ago

Well done for finding a workaround.